Space-Time Prediction of Bike Share Demand

2025 Q1 VS 2024 Q4

Author

Tess Vu

Published

December 1, 2025

PART I: 2024 Q4 VS. 2025 Q1

1. Data Download

Indego Bikeshare Data Using quarter 4 due to the presumed ridership stability in the colder winter seasons as opposed to Q2 and Q3 that might have more variability and leisure ridership.

Iowa Environmental Mesonet (IEM) ASOS PHL Weather Station Downloaded for years aligning with Indego. Issue through riem library where it wouldn’t specifically download 03/2024 for some reason.

Code
library(tidyverse)
library(lubridate)
library(janitor)
library(zoo)
library(sf)
library(tigris)
library(tidycensus)
library(viridis)
library(knitr)
library(kableExtra)
library(patchwork)
library(scales)
library(showtext)
library(sysfonts)
library(glmnet)
library(fixest)

# Avoid spherical issues with joins.
sf_use_s2(FALSE)

# Load fonts
font_add_google("Outfit", "outfit")
font_add_google("Anonymous Pro", "anonymous")
showtext_opts(dpi = 300)
showtext_auto()

# Get rid of scientific notation.
options(scipen = 999)
Code
# Create custom plot theme for charts.
theme_plot <- function(base_size = 23) {
  theme_minimal(base_size = base_size) +
    theme(
      # Set title styling with custom font and color.
      plot.title = element_text(face = "bold",
                                family = "outfit",
                                color = "#2d2a26",
                                size = base_size + 1,
                                hjust = 0.5
                                ),
      # Set subtitle with italic styling.
      plot.subtitle = element_text(face = "italic",
                                   family = "outfit",
                                   color = "#51534a",
                                   size = base_size - 1,
                                   hjust = 0.5,
                                   margin = margin(b = 0.5, unit = "cm")
                                   ),
      # Set caption styling for source notes.
      plot.caption = element_text(face = "italic",
                                  family = "anonymous",
                                  color = "#9b9e98",
                                  size = base_size - 2
                                  ),
      # Position legend at bottom.
      legend.position = "bottom",
      # Set grid line colors.
      panel.grid.major = element_line(colour = "#d4d2cd"),
      panel.grid.minor = element_line(colour = "#d4d2cd"),
      # Style axis text and titles.
      axis.text = element_text(face = "italic",
                               family = "anonymous",
                               size = base_size - 2,
                               hjust = 0.5
                               ),
      axis.title = element_text(face = "bold",
                                family = "anonymous",
                                size = base_size - 1,
                                hjust = 0.5
                                ),
      # Add spacing around axis titles.
      axis.title.y = element_text(margin = margin(r = 0.5, unit = "cm")
                                  ),
      axis.title.x = element_text(margin = margin(t = 0.5, unit = "cm")
                                  ),
      # Style legend elements.
      legend.title = element_text(face = "italic",
                                  family = "anonymous",
                                  size = base_size - 1,
                                  hjust = 0.5
                                  ),
      legend.title.position = "top",
      legend.text = element_text(face = "italic",
                                 family = "anonymous",
                                 size = base_size - 2,
                                 hjust = 0.5
                                 ),
      # Set legend key dimensions.
      legend.key.width = unit(2, "cm"),
      legend.key.height = unit(0.5, "cm"),
      # Set background colors for consistent appearance.
      legend.background = element_rect(fill = "#f5f4f0", color = "#f5f4f0"),
      plot.background = element_rect(fill = "#f5f4f0", color = "#f5f4f0"),
      panel.background = element_rect(fill = "#f5f4f0", color = "#f5f4f0"),
      # Add plot margins.
      plot.margin = unit(c(1, 1, 1, 1), "cm")
    )
  }

# Create custom theme for maps.
# Similar to plot theme but removes axis elements.
theme_map <- function(base_size = 17) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold",
                                family = "outfit",
                                color = "#2d2a26",
                                size = base_size + 1,
                                hjust = 0.5
                                ),
      plot.subtitle = element_text(face = "italic",
                                   family = "outfit",
                                   color = "#51534a",
                                   size = base_size - 1,
                                   hjust = 0.5,
                                   margin = margin(b = 0.5, unit = "cm")
                                   ),
      plot.caption = element_text(face = "italic",
                                  family = "anonymous",
                                  color = "#9b9e98",
                                  size = base_size - 3
                                  ),
      legend.position = "bottom",
      # Remove grid lines for maps.
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      # Remove axis text and titles for maps.
      axis.text = element_blank(),
      axis.title = element_blank(),
      legend.title = element_text(face = "italic",
                                  family = "anonymous",
                                  size = base_size - 1,
                                  hjust = 0.5
                                  ),
      legend.title.position = "top",
      legend.text = element_text(face = "italic",
                                 family = "anonymous",
                                 size = base_size - 3,
                                 hjust = 0.5
                                 ),
      legend.key.width = unit(2, "cm"),
      legend.key.height = unit(0.5, "cm"),
      legend.background = element_rect(fill = "#f5f4f0", color = "#f5f4f0"),
      plot.background = element_rect(fill = "#f5f4f0", color = "#f5f4f0"),
      panel.background = element_rect(fill = "#f5f4f0", color = "#f5f4f0"),
      plot.margin = unit(c(1, 1, 1, 1), "cm")
    )
  }
Code
# Load Census API key from environment.
census_api_key <- Sys.getenv("CENSUS_API_KEY")

i. Load and Clean Bike Share Data

Code
# Load raw bike share data from CSV.
bike_data <- read.csv("data/indego_2024_2025.csv")

# Define multiple date format patterns to handle inconsistent formatting.
date_formats <- c(
  "%m/%d/%Y %H:%M", # 1/1/2020 10:30
  "%Y-%m-%d %H:%M:%S", # 2020-01-01 10:30:00
  "%m/%d/%y %H:%M", # 1/1/20 10:30
  "%m/%d/%Y %I:%M:%S %p", # 1/1/2020 10:30:00 AM/PM
  "%m/%d/%Y %I:%M %p" # 1/1/2020 10:30 AM/PM
  )

# Parse dates and clean data.
bike_data <- bike_data %>%
  mutate(
    # Parse start and end times with multiple format options.
    start_datetime_new = parse_date_time(start_time, orders = date_formats),
    end_datetime_new = parse_date_time(end_time, orders = date_formats)
    ) %>%
  filter(
    # Remove rows with unparseable dates.
    !is.na(start_datetime_new),
    !is.na(end_datetime_new),
    # Remove trips without station IDs.
    !is.na(start_station_id),
    # Remove trips with less than zero duration.
    duration > 0,
    # Filter to Philadelphia using bounding box.
    start_lon >= -75.30, start_lon <= -74.95,
    start_lat >= 39.85, start_lat <= 40.20
    ) %>%
  mutate(
    # Replace og datetime columns with cleaned versions.
    start_time = start_datetime_new,
    end_time = end_datetime_new,
    # Get date component for daily aggregation.
    date = as_date(start_time),
    # Get year for filtering.
    year = year(start_time),
    # Round to nearest hour.
    # Creates 30-min intervals.
    interval30 = floor_date(start_time, unit = "30 minutes"),
    # Create quarter-year label.
    quarter_year = paste0("Q", quarter(start_time), " ", year(start_time))
    ) %>%
  # Remove temporary datetime columns.
  select(-c(start_datetime_new, end_datetime_new))

ii. Subset Data by Time Period

Code
# Filter to 2025 Q1 data.
# Q1 includes January, February, March.
bike_q1 <- bike_data %>%
  filter(year == 2025, quarter(start_time) == 1)

# Print summary statistics for Q1.
cat("2025 Q1 Trips:", format(nrow(bike_q1), big.mark = ","), "\n")
2025 Q1 Trips: 201,134 
Code
cat("Date Range:", format(min(bike_q1$date), "%Y-%m-%d"), 
    "to", format(max(bike_q1$date), "%Y-%m-%d"), "\n\n")
Date Range: 2025-01-01 to 2025-03-31 
Code
cat("Start Stations:", length(unique(bike_q1$start_station)), "\n")
Start Stations: 264 
Code
# Trip type distribution.
table(bike_q1$trip_route_category)

   One Way Round Trip 
    190357      10777 
Code
# Passholder type distribution.
table(bike_q1$passholder_type)

  Day Pass   Indego30  Indego365 IndegoFlex    Walk-up 
      5500      93800      91423          3      10408 
Code
# Bike type distribution.
table(bike_q1$bike_type)

electric standard 
  129277    71857 
Code
# Filter to 2024 Q4 data.
# Q4 includes October, November, December.
bike_q4 <- bike_data %>%
  filter(year == 2024, quarter(start_time) == 4)

# Print summary statistics for Q4.
cat("2024 Q4 Trips:", format(nrow(bike_q4), big.mark = ","), "\n")
2024 Q4 Trips: 167,225 
Code
cat("Date Range:", format(min(bike_q4$date), "%Y-%m-%d"),
    "to", format(max(bike_q4$date), "%Y-%m-%d"), "\n")
Date Range: 2024-10-01 to 2024-12-31 
Code
cat("Start Stations:", length(unique(bike_q4$start_station)), "\n")
Start Stations: 255 
Code
# Trip type distribution.
table(bike_q4$trip_route_category)

   One Way Round Trip 
    157965       9260 
Code
# Passholder type distribution.
table(bike_q4$passholder_type)

  Day Pass   Indego30  Indego365 IndegoFlex       NULL    Walk-up 
      5962      88922      63552          1        454       8334 
Code
# Bike type distribution.
table(bike_q4$bike_type)

electric standard 
   98028    69197 

iii. Create 30-min Panel Data

Code
# Create function to aggregate trip-level data to station-hour panel.
# Panel data structure has one row per station-hour combination.
make_30min_panel <- function(data_frame) {
  data_frame %>%
    mutate(
      interval30 = floor_date(start_time, unit = "30 minutes"),
      time_slot_index = hour(interval30) * 2 + minute(interval30) / 30,
      time_continuous = hour(interval30) + minute(interval30) / 60,
      # Calculate index of 30-min slot within day.
      time_slot = as.factor(hour(interval30) * 2 + minute(interval30) / 30),
      # Get date for daily patterns.
      date = as_date(start_time),
      # Get day of week starting Monday as day 1.
      dow = wday(date, label = TRUE, week_start = 1),
      # Create weekend indicator.
      is_weekend = dow %in% c("Sat", "Sun"),
      # Get month for seasonal analysis.
      month = month(date),
      # Get year for year-over-year comparisons.
      year = year(date),
      # Get hour.
      hour = hour(start_time)
      ) %>%
    # Group by all time and space dimensions.
    group_by(
      start_station_id, start_lat, start_lon,
      interval30, date, year, month, hour, dow, is_weekend,
      time_slot, time_slot_index, time_continuous
      ) %>%
    # Count trips per group as dependent variable.
    summarize(trips = n(), .groups = "drop")
}
Code
# Create base panels.
panel_q1_base <- make_30min_panel(bike_q1)
panel_q4_base <- make_30min_panel(bike_q4)

# Panel structure.
cat("2025 Q1 Base Panel:", format(nrow(panel_q1_base), big.mark = ","), "rows\n")
2025 Q1 Base Panel: 146,569 rows
Code
cat("2024 Q4 Base Panel:", format(nrow(panel_q4_base), big.mark = ","), "rows\n")
2024 Q4 Base Panel: 112,732 rows

2. Trip Visualizations

i. Daily Patterns

Code
# Aggregate Q1 trips to daily.
daily_trips_q1 <- panel_q1_base %>%
  group_by(date) %>%
  summarize(trips = sum(trips), .groups = "drop")

# Create Q1 daily trips plot with trend line.
daily_trips_q1_plot <- ggplot(daily_trips_q1, aes(date, trips)) +
  # Plot raw daily trip counts as line.
  geom_line(color = "#778ac5", linewidth = 1) +
  # Add smoothed trend line to show overall pattern.
  geom_smooth(se = FALSE, color = "#ff4100", linetype = "dashed") +
  # Format y-axis with comma separators.
  scale_y_continuous(labels = comma) +
  labs(
    title = "Indego Daily Trips",
    subtitle = "2025 Q1",
    x = "Date", y = "Trips"
  ) +
  theme_plot()

# Aggregate Q4 trips to daily.
daily_trips_q4 <- panel_q4_base %>%
  group_by(date) %>%
  summarize(trips = sum(trips), .groups = "drop")

# Create Q4 daily trips plot with trend line.
daily_trips_q4_plot <- ggplot(daily_trips_q4, aes(date, trips)) +
  geom_line(color = "#778ac5", linewidth = 1) +
  geom_smooth(se = FALSE, color = "#ff4100", linetype = "dashed") +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Indego Daily Trips",
    subtitle = "2024 Q4",
    x = "Date", y = "Trips"
  ) +
  theme_plot()

# Stack plots vertically.
daily_trips_q1_plot / daily_trips_q4_plot

It looks like ridership has three significant dips in Q4, and surprisingly there is what looks to be very low ridership at the start of October despite the end of September not having any significant holidays. However, it could also just be where rider behavior starts to change as the weather changes with the season, or that the lengthening evenings facilitate less ridership.

It also looks like November and December have expected dips as well around major American holidays like Thanksgiving and Christmas. However, there are unexpected and very large dips at the beginning of November and December. After taking a look at the 2024 calendar, the new months started on the weekend which explains the drop, because it’s likely the majority of users are commuters for work than leisurely riders.

3. Census Data Integration

i. Load and Clean Census Data

Code
# Download 2023 ACS 5-year estimates for Philadelphia tracts.
# Suppress stuff from tidycensus.
philly_census <- suppressMessages(suppressWarnings(get_acs(
  geography = "tract",
  variables = c(
    "B01003_001", # Total population.
    "B19013_001", # Median household income.
    "B08301_001", # Total commuters for denominator.
    "B08301_010", # Public transportation commuters.
    "B02001_002", # White alone population.
    "B25077_001", # Median home value.
    "B15003_022", # Bachelor's degree or higher.
    "B08201_002", # Households with no vehicle.
    "B01001_002", # Male population.
    "B01001_026", # Female population.
    "B15003_001", # Adult population 25+ for education rates.
    "B25003_001" # Total households for car ownership rates.
  ),
  state = "PA",
  county = "Philadelphia",
  year = 2023,
  # Include geometry for spatial joins.
  geometry = TRUE,
  output = "wide"
  ) %>%
  # Rename variables.
  rename(
    total_pop = B01003_001E,
    med_inc = B19013_001E,
    total_commute = B08301_001E,
    public_commute = B08301_010E,
    white_pop = B02001_002E,
    med_h_val = B25077_001E,
    bach_plus = B15003_022E,
    no_car = B08201_002E,
    male_pop = B01001_002E,
    female_pop = B01001_026E,
    adult_pop = B15003_001E,
    total_hh = B25003_001E
  ) %>%
  mutate(
    # Calculate percentage taking public transit.
    # Use pmax to avoid division by zero.
    pct_public_commute = public_commute / pmax(total_commute, 1),
    # Calculate percentage white for demographic context.
    pct_white = white_pop / pmax(total_pop, 1),
    # Calculate percentage with bachelor's or higher.
    pct_bach_plus = bach_plus / pmax(adult_pop, 1),
    # Calculate percentage of households without car.
    pct_no_car = no_car / pmax(total_hh, 1),
    # Calculate percentage male for demographic balance.
    pct_male = male_pop / pmax(total_pop, 1)
  ) %>%
  st_transform(4326)
  )
  )

# Census placeholder values that indicate missing data.
# Used instead of NA in some ACS tables.
bad_placeholders <- c(-666666666, -999999999, -6666666, -999999)

# Replace with NA values.
philly_census <- philly_census %>%
  mutate(across(
    # Apply to all numeric census variables.
    c(total_pop, med_inc, total_commute, public_commute, white_pop, med_h_val,
      bach_plus, no_car, male_pop, female_pop, adult_pop, total_hh), 
    # Replace bad values with NA.
    ~ case_when(.x %in% bad_placeholders ~ NA_real_, TRUE ~ .x)
    )
    )

# Verification.
cat("Philadelphia Census Tracts:", nrow(philly_census), "\n")

ii. Create Station-Census Lookup Table

Code
# Get unique station locations from Q4 data.
stations_geo <- bike_q4 %>%
  group_by(start_station_id) %>%
  summarize(
    # Take first coordinates.
    start_lat = first(start_lat),
    start_lon = first(start_lon),
    .groups = "drop"
    ) %>%
  # Remove stations missing coordinates.
  filter(!is.na(start_lat), !is.na(start_lon))

# Convert stations to sf object for spatial operations.
stations_sf <- stations_geo %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)

# Spatial join stations to census tracts.
station_census_raw <- suppressMessages(
  st_join(stations_sf, philly_census, join = st_intersects, left = TRUE)
  )

# Create lookup table with one row per station.
station_census_lookup <- station_census_raw %>%
  st_drop_geometry() %>%
  # Sort for consistent selection if multiple matches.
  arrange(start_station_id, med_inc) %>%
  # Group by station to handle any duplicates.
  group_by(start_station_id) %>%
  # Keep only first match per station.
  # Edge cases where station falls on tract boundary.
  slice(1) %>%
  ungroup()%>%
  # Only variables needed for modeling.
  select(start_station_id, med_inc, pct_public_commute, pct_white, total_pop,
         pct_bach_plus, pct_no_car, pct_male) %>%
  # Filter to stations within residential areas.
  filter(!is.na(med_inc))

# Extract list of stations with valid census data.
valid_stations <- station_census_lookup$start_station_id

# Add coordinates back for mapping.
station_census_lookup <- station_census_lookup %>%
  left_join(stations_geo, by = "start_station_id")

# Check for duplicate stations.
n_duplicates <- station_census_lookup %>%
  count(start_station_id) %>%
  filter(n > 1) %>%
  nrow()

# Print summary statistics.
cat("Valid Residential Stations:", length(valid_stations), "\n")
Valid Residential Stations: 235 
Code
cat("Duplicate Stations:", n_duplicates, "\n")
Duplicate Stations: 0 
Code
# Get distinct stations from Q4 for mapping.
stations_q4 <- bike_q4 %>%
  distinct(start_station_id, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon))

# Map stations overlaid on income.
med_inc_station_map <- ggplot() +
  # Plot census tracts colored by median income.
  geom_sf(data = philly_census, aes(fill = med_inc), color = NA) +
  # direction = -1 reverses scale. Darker = higher income.
  scale_fill_viridis(
    option = "mako",
    direction = -1,
    name   = "Median Income",
    labels = dollar,
    na.value = "#9b9e98"
    ) +
  # Add station points on top.
  geom_point(
    data = stations_q4,
    aes(x = start_lon, y = start_lat),
    color = "#ff4100", size = 1, alpha = 0.8
    ) +
  labs(
    title = "Indego Stations",
    subtitle = "Philadelphia, PA",
    caption = "Color: Median household income by census tracts."
    ) +
  theme_map()

med_inc_station_map

iii. Map Stations and Census Context

Code
# Map with all stations and income.
med_inc_station_map <- ggplot() +
  geom_sf(data = philly_census, aes(fill = med_inc), color = NA) +
  scale_fill_viridis(
    option = "mako",
    direction = -1,
    name = "Median Income",
    labels = dollar,
    na.value = "#9b9e98"
    ) +
  geom_point(
    data = stations_geo,
    aes(x = start_lon, y = start_lat),
    color = "#ff4100", size = 1, alpha = 0.8
    ) +
  labs(
    title = "Indego Stations",
    subtitle = "Philadelphia, PA"
    ) +
  theme_map()

# Data showing which stations have census matches.
stations_for_map <- stations_geo %>%
  left_join(
    station_census_lookup %>% select(start_station_id, med_inc),
    by = "start_station_id"
    ) %>%
  # Flag whether station has census data.
  mutate(has_census = !is.na(med_inc))

# Map highlighting missing census matches.
missing_station_map <- ggplot() +
  geom_sf(data = philly_census, aes(fill = med_inc), color = NA) +
  scale_fill_viridis(
    option = "mako",
    direction = -1,
    name = "Median Income",
    labels = dollar,
    na.value = "#9b9e98"
    ) +
  # Plot stations with census data as gray dots.
  geom_point(
    data = stations_for_map %>% filter(has_census),
    aes(start_lon, start_lat),
    color = "#51534a", size = 1, alpha = 0.6
    ) +
  # Plot stations without census data as orange X marks.
  geom_point(
    data = stations_for_map %>% filter(!has_census),
    aes(start_lon, start_lat),
    color = "#ff4100", size = 1, shape = 4, stroke = 1
    ) +
  labs(
    title = "Indego Stations and Census Matches",
    subtitle = "Philadelphia, PA",
    caption = "Orange X: Stations without tract-level demographics.\nColor: Median household income by census tract."
    ) +
  theme_map()

# Combine maps side by side with shared legend.
(med_inc_station_map | missing_station_map) + 
  plot_layout(guides = "collect") & 
  theme(legend.position = "bottom",
        legend.direction = "horizontal")

Unfortunately this exploration is restricted to residential areas, and while it does narrow the scope of things, stations in non-residential areas may have larger ridership due to them being in areas zoned as parks (Fairmount and FDR Parks) or business/commercial (Xfinity Mobile Arena), and there’s also a significant portion of stations omitted in the University City area at the University of Pennsylvania’s campus.

4. Weather Data Integration

i. Load and Clean Weather Data

Code
# Load 30-min weather data from IEM ASOS station.
weather_data <- read.csv("data/PHL.csv") %>%
  # Select relevant weather variables.
  select(valid, tmpf, dwpf, relh, sknt, p01i, vsby, gust, wxcodes, feel) %>%
  rename(
    interval_raw = valid, 
    temp = tmpf,
    dew = dwpf,
    humid = relh,
    wind = sknt,
    precip = p01i,
    visibility = vsby,
    w_code = wxcodes
    )

# Replace string "null" with NA.
weather_data[weather_data == "null"] <- NA

# Clean and interpolate weather data.
weather_clean <- weather_data %>%
  mutate(
    # Parse datetime string to POSIXct object.
    interval30 = as.POSIXct(interval_raw, format = "%Y-%m-%d %H:%M", tz = "UTC"),
    # Round to exact hour for initial alignment.
    interval30 = floor_date(interval30, unit = "hour"), 
    # Numeric conversion (rest of the mutate block remains the same)
    temp = as.numeric(temp),
    precip = as.numeric(precip),
    dew = as.numeric(dew),
    humid = as.numeric(humid),
    wind = as.numeric(wind),
    visibility = as.numeric(visibility),
    feel = as.numeric(feel),
    gust = as.numeric(gust)
    ) %>%
  # Keep only unique time points.
  distinct(interval30, .keep_all = TRUE) %>%
  # Sort by time for interpolation.
  arrange(interval30) %>%
  mutate(
    # Interpolate sparse missing values (as before).
    temp = na.approx(temp, na.rm = FALSE),
    dew = na.approx(dew, na.rm = FALSE),
    precip = ifelse(is.na(precip), 0, precip),
    gust = ifelse(is.na(gust), 0, gust),
    w_code = case_when(is.na(w_code) | w_code == "" ~ "CLR", TRUE ~ w_code)
    ) %>%
  complete(interval30 = seq(min(interval30), max(interval30), by = "30 min")) %>%
  # Fill 30-min with :00.
  fill(temp, dew, humid, wind, precip, feel, .direction = "down") 

# Verification.
cat("Weather:", format(nrow(weather_clean), big.mark = ","), "\n")
Weather: 190,987 

ii. Weather Pattern Visualization

Code
# Filter weather to Q1 date range.
weather_q1 <- weather_clean %>%
  filter(
    interval30 >= min(panel_q1_base$interval30),
    interval30 <= max(panel_q1_base$interval30)
    )

# Create Q1 temperature plot.
weather_q1_plot <- ggplot(weather_q1, aes(interval30, temp)) +
  # Plot 30-min temperature as line.
  geom_line(color = "#00a557") +
  # Add smoothed trend.
  geom_smooth(se = FALSE, color = "#ff4100") +
  labs(
    title = "Philadelphia Temperature",
    subtitle = "2025 Q1",
    x = "Date", y = "Temperature (°F)"
    ) +
  theme_plot()

# Filter weather to Q4 date range.
weather_q4_daily <- weather_clean %>%
  filter(
    interval30 >= min(panel_q4_base$interval30),
    interval30 <= max(panel_q4_base$interval30)
    )

# Create Q4 temperature plot.
weather_q4_daily_plot <- ggplot(weather_q4_daily, aes(interval30, temp)) +
  geom_line(color = "#00a557") +
  geom_smooth(se = FALSE, color = "#ff4100") +
  labs(
    title = "Philadelphia Temperature",
    subtitle = "2024 Q4",
    x = "Date", y = "Temperature (°F)"
    ) +
  theme_plot()

# Aggregate to monthly averages for visualization.
weather_q4_monthly <- weather_clean %>%
  group_by(year = year(interval30), month = month(interval30)) %>%
  summarize(
    temp = mean(temp, na.rm = TRUE),
    date_month = make_date(year, month, 1),
    .groups = "drop"
    )

# Stack temperature plots vertically.
weather_q1_plot / weather_q4_daily_plot

Q1 and Q4 have noticeable differences already in their temperature patterns. Q4 has a steady decrease into the colder and more discomforting winter season as opposed to Q1’s slow creep into the warmer spring season.

5. Build Complete Space-Time Panel

i. 2025 Q1 Complete Panel

Code
# Filter to valid stations.
panel_q1_base <- panel_q1_base %>%
  filter(start_station_id %in% valid_stations)

# Get unique times and stations.
unique_times_q1 <- unique(panel_q1_base$interval30) %>% sort()
unique_stations_q1 <- valid_stations

# Verification.
cat("2025 Q1 Complete Panel:\n")
2025 Q1 Complete Panel:
Code
cat("Stations:", length(unique_stations_q1), "\n")
Stations: 235 
Code
cat("Time Periods:", format(length(unique_times_q1), big.mark = ","), "\n")
Time Periods: 4,228 
Code
cat("Expected Rows:", format(length(unique_stations_q1) * length(unique_times_q1), big.mark = ","))
Expected Rows: 993,580
Code
# Create complete space-time grid.
complete_grid_q1 <- expand.grid(
  interval30 = unique_times_q1,
  start_station_id = unique_stations_q1,
  # Don't keep attributes to reduce memory.
  KEEP.OUT.ATTRS = FALSE,
  # Don't convert to factors.
  stringsAsFactors = FALSE
  )

# Get temporal features from interval30.
complete_grid_q1 <- complete_grid_q1 %>%
  mutate(
    # Get date.
    date = as_date(interval30),
    # Get year.
    year = year(interval30),
    # Get month for seasonal patterns.
    month = month(interval30),
    # Get hour.
    hour = hour(interval30),
    # Get day of week starting Monday.
    dow = wday(date, label = TRUE, week_start = 1),
    # Continuous time.
    time_continuous = hour(interval30) + minute(interval30) / 60,
    # Add 30-min intervals.
    time_slot = as.factor(hour(interval30) * 2 + minute(interval30) / 30),
    # Create weekend binary.
    is_weekend = ifelse(dow %in% c("Sat", "Sun"), 1, 0),
    # Create rush hour indicator.
    # Defined as morning (7-9) and evening (4-6) commute times.
    rush_hour = ifelse(time_slot %in% c(14:19, 32:37), 1, 0),
    )
Code
# Merge observed trip counts onto complete grid.
panel_q1_with_trips <- complete_grid_q1 %>%
  left_join(
    # Only need station, time, and trip count.
    panel_q1_base %>% select(start_station_id, interval30, trips),
    by = c("start_station_id", "interval30")
    ) %>%
  # Fill missing trip counts with zero.
  # Missing means no trips in that hour.
  mutate(trips = replace_na(trips, 0))

# Add station-level demographic attributes.
panel_q1_with_station <- panel_q1_with_trips %>%
  left_join(station_census_lookup, by = "start_station_id")

# Add 30-min weather conditions.
panel_q1_with_weather <- panel_q1_with_station %>%
  left_join(weather_clean, by = "interval30")
Code
# Calculate temporal lag features.
# Sort by station and time for correct ordering.
panel_q1 <- panel_q1_with_weather %>%
  arrange(start_station_id, interval30) %>%
  # Group by station for station-specific lags.
  group_by(start_station_id) %>%
  mutate(
    lag_30min = lag(trips, 1),
    lag_1hr = lag(trips, 2),
    lag_1hr30min = lag(trips, 3),
    lag_2hr = lag(trips, 4),
    lag_2hr30min = lag(trips, 5),
    lag_3hr = lag(trips, 6),
    lag_3hr30min = lag(trips, 7),
    lag_4hr = lag(trips, 8),
    lag_4hr30min = lag(trips, 9),
    lag_5hr = lag(trips, 10),
    lag_5hr30min = lag(trips, 11),
    lag_6hr = lag(trips, 12),
    lag_6hr30min = lag(trips, 13),
    lag_7hr = lag(trips, 14),
    lag_7hr30min = lag(trips, 15),
    lag_8hr = lag(trips, 16),
    lag_8hr30min = lag(trips, 17),
    lag_9hr = lag(trips, 18),
    lag_9hr30min = lag(trips, 19),
    lag_10hr = lag(trips, 20),
    lag_10hr30min = lag(trips, 21),
    lag_11hr = lag(trips, 22),
    lag_11hr30min = lag(trips, 23),
    # Half-day lag captures diurnal cycle.
    lag_12hr = lag(trips, 24),
    # Full day lag captures day-to-day patterns.
    lag_1day = lag(trips, 48),
    # Rolling 7-day average captures weekly baseline.
    # 168 hours = 7 days * 24 hours.
    # align = "right" means backwards.
    avg_7day = rollapply(trips, 336, mean, fill = NA, align = "right")
  ) %>%
  ungroup() %>%
  # Remove rows where 7-day average not yet calculable.
  # Removes first week of data per station.
  filter(!is.na(avg_7day))

# Verification.
cat("Final 2025 Q1 Panel:", format(nrow(panel_q1), big.mark = ","), "\n")
Final 2025 Q1 Panel: 914,855 

ii. 2024 Q4 Complete Panel

Code
# Filter panel to stations with census data.
panel_q4_base <- panel_q4_base %>%
  filter(start_station_id %in% valid_stations)

# Extract unique dimensions.
unique_times_q4 <- unique(panel_q4_base$interval30) %>% sort()
unique_stations_q4 <- valid_stations

# Verification.
cat("2024 Q4 Complete Panel:\n")
2024 Q4 Complete Panel:
Code
cat("Stations:", length(unique_stations_q4), "\n")
Stations: 235 
Code
cat("Time Periods:", format(length(unique_times_q4), big.mark = ","), "\n")
Time Periods: 2,668 
Code
cat( "Expected Rows", format(length(unique_stations_q4) * length(unique_times_q4), big.mark = ","), "\n\n" )
Expected Rows 626,980 
Code
# Create complete space-time grid for Q4.
complete_grid_q4 <- expand.grid(
  interval30 = unique_times_q4,
  start_station_id = unique_stations_q4,
  KEEP.OUT.ATTRS = FALSE,
  stringsAsFactors = FALSE
)

# Get temporal features from datetime.
complete_grid_q4 <- complete_grid_q4 %>%
  mutate(
    # Get date.
    date = as_date(interval30),
    # Get year.
    year = year(interval30),
    # Get month for seasonal patterns.
    month = month(interval30),
    # Get hour.
    hour = hour(interval30),
    # Get day of week starting Monday.
    dow = wday(date, label = TRUE, week_start = 1),
    # Continuous time.
    time_continuous = hour(interval30) + minute(interval30) / 60,
    # Add 30-min intervals.
    time_slot = as.factor(hour(interval30) * 2 + minute(interval30) / 30),
    # Create weekend binary.
    is_weekend = ifelse(dow %in% c("Sat", "Sun"), 1, 0),
    # Create rush hour indicator.
    # Defined as morning (7-9) and evening (4-6) commute times.
    rush_hour = ifelse(time_slot %in% c(14:19, 32:37), 1, 0),
  )
Code
# Merge trip counts onto complete grid.
panel_q4_with_trips <- complete_grid_q4 %>%
  left_join(
    panel_q4_base %>% select(start_station_id, interval30, trips),
    by = c("start_station_id", "interval30")
  ) %>%
  # Fill zero for hours with no observed trips.
  mutate(trips = replace_na(trips, 0))

# Add station demographic attributes.
panel_q4_with_station <- panel_q4_with_trips %>%
  left_join(station_census_lookup, by = "start_station_id")

# Add 30-min weather conditions.
panel_q4_with_weather <- panel_q4_with_station %>%
  left_join(weather_clean, by = "interval30")
Code
# Calculate temporal lag features for Q4.
panel_q4 <- panel_q4_with_weather %>%
  arrange(start_station_id, interval30) %>%
  group_by(start_station_id) %>%
  mutate(
    lag_30min = lag(trips, 1),
    lag_1hr = lag(trips, 2),
    lag_1hr30min = lag(trips, 3),
    lag_2hr = lag(trips, 4),
    lag_2hr30min = lag(trips, 5),
    lag_3hr = lag(trips, 6),
    lag_3hr30min = lag(trips, 7),
    lag_4hr = lag(trips, 8),
    lag_4hr30min = lag(trips, 9),
    lag_5hr = lag(trips, 10),
    lag_5hr30min = lag(trips, 11),
    lag_6hr = lag(trips, 12),
    lag_6hr30min = lag(trips, 13),
    lag_7hr = lag(trips, 14),
    lag_7hr30min = lag(trips, 15),
    lag_8hr = lag(trips, 16),
    lag_8hr30min = lag(trips, 17),
    lag_9hr = lag(trips, 18),
    lag_9hr30min = lag(trips, 19),
    lag_10hr = lag(trips, 20),
    lag_10hr30min = lag(trips, 21),
    lag_11hr = lag(trips, 22),
    lag_11hr30min = lag(trips, 23),
    # Half-day lag captures diurnal cycle.
    lag_12hr = lag(trips, 24),
    # Full day lag captures day-to-day patterns.
    lag_1day = lag(trips, 48),
    # Rolling 7-day average captures weekly baseline.
    # 168 hours = 7 days * 24 hours.
    # align = "right" means backwards.
    avg_7day = rollapply(trips, 336, mean, fill = NA, align = "right")
  ) %>%
  ungroup() %>%
  # Remove rows where weekly average not calculable.
  filter(!is.na(avg_7day))

# Verification.
cat("Final 2024 Q4 Panel:", format(nrow(panel_q4), big.mark = ","), "\n")
Final 2024 Q4 Panel: 548,255 

6. Visualize Temporal Patterns

i. Lag Plots

Code
# Station 3010 is 15th & Spruce in Center City.
lag_data_long_q4 <- panel_q4 %>%
  filter(start_station_id == 3010) %>%
  # Take first week (336 30-min intervals).
  head(336) %>%
  # Select current trips and all lag variables.
  select(interval30, current_trips = trips, lag_1hr, lag_2hr, lag_3hr, lag_12hr, lag_1day) %>%
  # Reshape to long format for faceting.
  pivot_longer(
    cols = starts_with("lag"),
    names_to = "lag_type",
    values_to = "lag_value"
    ) %>%
  mutate(lag_type = factor(
    lag_type,
    levels = c("lag_1hr", "lag_2hr", "lag_3hr", "lag_12hr", "lag_1day"),
    )
    )

# Create faceted lag comparison plot for Q4.
lag_facet_plot_q4 <- ggplot(lag_data_long_q4, aes(x = interval30)) +
  # Plot current demand.
  geom_line(aes(y = current_trips, color = "Current Demand"), linewidth = 0.75, alpha = 0.8) +
  # Plot lagged demand.
  geom_line(aes(y = lag_value, color = "Lagged Demand"), linewidth = 0.75, linetype = "dashed") +
  # Separate panel for each lag period.
  facet_wrap(~ lag_type, scales = "free_x", ncol = 2) +
  # Define colors for two series.
  scale_color_manual(
    name = NULL,
    values = c("Current Demand" = "#4A6F53", "Lagged Demand" = "#f48f33")
    ) +
  labs(
    title = "One Week Short-Term vs. Daily Temporal Lags",
    subtitle = "2024 Q4",
    caption = "Station 3010: 15th & Spruce",
    x = "Date and Time",
    y = "Average Trip Count"
    ) +
  theme_plot() +
  theme(strip.text = element_text(family = "anonymous", face = "bold", size = 20))

# Repeat for Q1 data.
lag_data_long_q1 <- panel_q1 %>%
  filter(start_station_id == 3010) %>%
  head(168) %>%
  select(interval30, current_trips = trips, lag_1hr, lag_2hr, lag_3hr, lag_12hr, lag_1day) %>%
  pivot_longer(
    cols = starts_with("lag"),
    names_to = "lag_type",
    values_to = "lag_value"
  ) %>%
  mutate(lag_type = factor(
    lag_type,
    levels = c("lag_1hr", "lag_2hr", "lag_3hr", "lag_12hr", "lag_1day"),
  ))

# Create faceted lag comparison plot for Q1.
lag_facet_plot_q1 <- ggplot(lag_data_long_q1, aes(x = interval30)) +
  geom_line(aes(y = current_trips, color = "Current Demand"), linewidth = 0.75, alpha = 0.8) +
  geom_line(aes(y = lag_value, color = "Lagged Demand"), linewidth = 0.75, linetype = "dashed") +
  facet_wrap(~ lag_type, scales = "free_x", ncol = 2) +
  scale_color_manual(
    name = NULL,
    values = c("Current Demand" = "#4A6F53", "Lagged Demand" = "#f48f33")
  ) +
  labs(
    title = "One Week Short-Term vs. Daily Temporal Lags",
    subtitle = "2025 Q1",
    x = "Date and Time",
    y = "Trip Count"
  ) +
  theme_plot() +
  theme(strip.text = element_text(family = "anonymous", face = "bold", size = 20))

# Stack plots vertically with shared legend.
(lag_facet_plot_q1 / lag_facet_plot_q4) +
  plot_layout(guides = "collect") & 
  theme(legend.position = "bottom",
        legend.direction = "horizontal")

This is the lag plot of the most popular Indego station at 15th & Spruce in Center City, just 4 or 5 blocks south of City Hall. It looks like the lag plots have a lot more structure in shorter lags, and around 12 hours it still has a similar structure and around 24 is where there’s more noise—looks like Tobler’s First Law could be applied to temporal aspects.

ii. 30-min Ridership Patterns

Code
# Calculate average trips by hour and day type for Q4.
min_30patterns_q4 <- panel_q4 %>%
  group_by(time_continuous, is_weekend) %>%
  summarize(avg_trips = mean(trips, na.rm = TRUE), .groups = "drop") %>%
  mutate(day_type = ifelse(is_weekend, "Weekend", "Weekday"))

# Create 30-min pattern plot for Q4.
min_30patterns_q4_plot <- ggplot(min_30patterns_q4, aes(x = time_continuous, y = avg_trips, color = day_type)) +
  # Plot as line to show continuous time pattern.
  geom_line(linewidth = 1.5) +
  # Use distinct colors for weekday vs weekend.
  scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Average 30-min Ridership Patterns",
    subtitle = "2024 Q4",
    x = "30-Min of Day",
    y = "Average Trips per 30-Min",
    color = "Day Type"
    ) +
  theme_plot()

# Calculate average trips by hour and day type for Q1.
min_30patterns_q1 <- panel_q1 %>%
  group_by(time_continuous, is_weekend) %>%
  summarize(avg_trips = mean(trips, na.rm = TRUE), .groups = "drop") %>%
  mutate(day_type = ifelse(is_weekend, "Weekend", "Weekday"))

# Create 30-min pattern plot for Q1.
min_30patterns_q1_plot <- ggplot(min_30patterns_q1, aes(x = time_continuous, y = avg_trips, color = day_type)) +
  geom_line(linewidth = 1.5) +
  scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Average 30-min Ridership Patterns",
    subtitle = "2025 Q1",
    caption = "Average trips per station",
    x = "30-Min of Day",
    y = "Average Trips per 30-Min",
    color = "Day Type"
    ) +
  theme_plot()

# Stack plots vertically.
min_30patterns_q4_plot / min_30patterns_q1_plot

It looks like the peak rush hours are maintained even through different seasons (colder Q4 vs. warmer Q1).

iii. Top Stations Table

Code
# Count total trips by station for Q4.
top_stations <- bike_q4 %>%
  count(start_station_id, start_lat, start_lon, name = "trips") %>%
  # Sort from highest to lowest.
  arrange(desc(trips))

# Create formatted table.
top_stations_kable <- top_stations %>%
  mutate(
    # Add comma separators to trip counts.
    trips = scales::comma(trips)
    ) %>%
  # Generate kable table.
  kable(
    caption = "Top 20 Indego Stations by Trip Origins (2024 Q4)",
    col.names = c("Station ID", "Latitude", "Longitude", "Total Trips"),
    align = c("c", "c", "c", "r")
    ) %>%
  # Add styling.
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
    ) %>%
  # Add explanatory footnote.
  footnote(
    general = "Total trips originating from the station during 2024 Q4 period.",
    general_title = "Note:",
    footnote_as_chunk = FALSE
    )

top_stations_kable
Top 20 Indego Stations by Trip Origins (2024 Q4)
Station ID Latitude Longitude Total Trips
3010 39.94711 -75.16618 3,412
3032 39.94527 -75.17971 2,508
3359 39.94888 -75.16978 2,229
3244 39.93865 -75.16674 2,012
3295 39.95028 -75.16027 1,892
3028 39.94061 -75.14958 1,885
3020 39.94855 -75.19007 1,882
3066 39.94561 -75.17348 1,868
3054 39.96250 -75.17420 1,856
3208 39.95048 -75.19324 1,853
3101 39.94295 -75.15955 1,808
3362 39.94816 -75.16226 1,734
3022 39.95472 -75.18323 1,723
3012 39.94218 -75.17747 1,669
3185 39.95169 -75.15888 1,663
3256 39.95269 -75.17779 1,659
3059 39.96244 -75.16121 1,657
3063 39.94633 -75.16980 1,608
3007 39.94517 -75.15993 1,589
3061 39.95425 -75.17761 1,577
3301 39.95029 -75.17126 1,553
3203 39.94077 -75.17227 1,546
3052 39.94732 -75.15695 1,518
3030 39.93935 -75.15716 1,512
3161 39.95486 -75.18091 1,512
3102 39.96759 -75.17952 1,495
3034 39.93315 -75.16248 1,479
3046 39.95012 -75.14472 1,477
3038 39.94781 -75.19409 1,427
3163 39.94974 -75.18097 1,416
3100 39.92777 -75.15103 1,340
3363 39.94655 -75.16263 1,335
3051 39.96744 -75.17507 1,332
3058 39.96716 -75.17001 1,283
3026 39.94182 -75.14550 1,263
3294 39.95174 -75.17063 1,262
3304 39.94234 -75.15399 1,254
3264 39.94571 -75.15508 1,236
3025 39.93724 -75.16120 1,167
3018 39.95273 -75.15979 1,157
3033 39.95005 -75.15672 1,152
3164 39.92814 -75.16515 1,135
3009 39.95576 -75.18982 1,133
3201 39.95523 -75.16620 1,112
3255 39.95095 -75.16438 1,100
3064 39.93840 -75.17327 1,087
3156 39.95381 -75.17407 1,079
3207 39.95441 -75.19200 1,071
3029 39.95380 -75.19479 1,067
3205 39.95405 -75.16783 1,051
3154 39.95924 -75.15821 1,045
3296 39.95134 -75.16758 1,033
3037 39.95424 -75.16138 1,028
3361 39.93082 -75.17474 1,020
3157 39.92545 -75.15954 1,005
3114 39.93775 -75.18012 1,004
3318 39.95182 -75.16395 1,004
3078 39.95355 -75.17192 1,003
3069 39.93704 -75.15038 996
3053 39.93231 -75.18154 987
3346 39.94883 -75.15097 981
3333 39.95410 -75.16965 974
3340 39.96906 -75.18169 969
3360 39.93304 -75.16822 967
3057 39.96439 -75.17987 960
3299 39.94332 -75.15698 942
3055 39.95112 -75.15457 941
3165 39.95819 -75.17820 931
3212 39.96383 -75.18177 930
3382 39.95038 -75.17309 927
3125 39.94391 -75.16735 917
3021 39.95390 -75.16902 912
3040 39.96289 -75.16606 912
3159 39.95121 -75.19962 911
3072 39.93445 -75.14541 874
3374 39.97280 -75.17971 850
3098 39.93431 -75.16042 842
3373 39.95817 -75.15668 813
3162 39.94595 -75.18475 801
3182 39.95081 -75.16953 801
3075 39.96718 -75.16125 796
3015 39.94735 -75.14886 785
3150 39.92884 -75.17021 781
3160 39.95662 -75.19862 780
3088 39.96984 -75.14180 767
3041 39.96849 -75.13546 766
3068 39.93549 -75.16711 757
3204 39.96437 -75.16582 755
3120 39.97522 -75.18669 752
3248 39.94213 -75.18335 747
3074 39.95511 -75.20987 735
3322 39.93638 -75.15526 735
3006 39.95220 -75.20311 713
3235 39.96000 -75.16510 710
3336 39.95719 -75.19388 710
3155 39.94018 -75.15442 708
3245 39.97887 -75.15777 704
3371 39.95340 -75.15430 693
3073 39.96143 -75.15242 691
3047 39.95073 -75.14947 678
3168 39.95134 -75.17394 676
3005 39.94733 -75.14403 675
3166 39.97195 -75.13445 675
3345 39.96170 -75.14667 669
3169 39.95382 -75.14263 656
3086 39.94019 -75.16691 655
3297 39.95821 -75.16901 653
3049 39.94509 -75.14250 651
3197 39.92083 -75.17033 649
3035 39.96271 -75.19419 648
3158 39.92552 -75.16904 647
3314 39.93682 -75.14492 632
3342 39.96390 -75.17385 631
3347 39.95484 -75.15449 623
3116 39.96006 -75.17198 622
3393 39.96727 -75.13999 621
3062 39.95197 -75.17943 610
3187 39.95725 -75.17232 610
3031 39.98005 -75.15522 608
3249 39.95784 -75.19079 601
3099 39.93401 -75.15094 592
3261 39.96304 -75.14099 584
3395 39.93866 -75.16351 583
3300 39.95511 -75.16287 578
3365 39.96173 -75.18788 572
3375 39.96036 -75.14020 568
3110 39.96175 -75.13641 536
3210 39.98492 -75.15668 534
3266 39.92370 -75.14701 531
3321 39.95485 -75.17298 529
3124 39.95367 -75.13956 518
3276 39.96569 -75.15679 509
3236 39.96566 -75.17765 498
3328 39.98798 -75.19950 488
3394 39.92400 -75.16959 487
3056 39.97669 -75.15813 483
3286 39.96409 -75.14769 476
3097 39.97888 -75.13339 471
3017 39.98003 -75.14371 470
3252 39.93882 -75.19324 468
3303 39.95590 -75.20530 463
3104 39.96664 -75.19209 462
3112 39.95373 -75.21825 453
3341 39.98131 -75.12732 450
3387 39.97012 -75.14666 449
3349 39.93651 -75.18621 447
3383 39.96634 -75.14222 439
3209 39.94900 -75.21278 437
3008 39.98081 -75.15067 425
3257 39.92160 -75.15000 418
3285 39.95946 -75.14745 408
3320 39.97460 -75.15917 403
3184 39.94573 -75.19551 402
3337 39.97318 -75.14612 400
3397 39.93906 -75.17824 398
3378 39.95238 -75.14728 397
3279 39.92650 -75.16680 394
3275 39.97576 -75.12157 386
3259 39.95998 -75.19883 384
3039 39.97157 -75.15993 382
3014 39.95886 -75.17369 380
3060 39.95923 -75.17036 377
3331 39.97116 -75.12813 360
3077 39.97207 -75.16351 354
3211 39.97172 -75.17060 345
3396 39.92327 -75.18210 345
3246 39.94782 -75.22301 342
3237 39.91717 -75.17096 322
3115 39.97263 -75.16757 321
3024 39.94822 -75.20908 317
3238 39.94628 -75.15138 317
3265 39.95957 -75.19633 316
3319 39.96672 -75.12964 306
3369 40.00772 -75.19194 305
3019 39.95403 -75.14983 294
3067 39.96411 -75.19973 292
3188 39.90471 -75.17340 289
3258 39.95828 -75.22424 288
3254 39.98091 -75.16065 287
3324 39.99484 -75.19242 273
3118 39.95866 -75.21323 272
3267 39.91800 -75.17960 271
3250 39.95673 -75.14575 266
3381 39.94845 -75.21807 259
3226 39.98101 -75.13721 254
3376 39.97183 -75.14878 254
3119 39.96674 -75.20799 233
3291 40.01022 -75.19700 232
3152 39.94993 -75.20286 229
3200 39.97583 -75.16846 224
3370 40.01630 -75.21241 220
3263 39.95860 -75.20810 219
3241 39.95225 -75.22637 214
3310 39.98024 -75.18651 212
3253 39.93174 -75.18996 206
3107 39.98203 -75.18866 204
3313 39.91580 -75.16430 198
3357 39.91040 -75.16588 191
3325 39.97680 -75.12764 178
3335 40.02237 -75.21800 177
3233 39.95501 -75.15208 175
3392 39.93633 -75.19447 170
3093 39.98837 -75.18701 162
3277 39.98306 -75.16738 159
3123 39.98004 -75.17088 158
3353 40.00496 -75.15233 158
3344 39.95961 -75.23354 156
3287 39.94367 -75.21606 152
3016 39.96892 -75.15470 151
3368 39.97647 -75.17362 149
3251 39.93366 -75.19319 144
3065 39.97070 -75.15171 143
3350 39.99415 -75.15527 136
3243 39.98350 -75.14743 135
3338 39.95098 -75.23035 124
3268 39.97633 -75.15101 121
3390 39.96000 -75.22440 121
3111 39.97779 -75.21323 116
3388 39.96634 -75.15097 115
3326 39.99198 -75.19944 112
3117 39.97841 -75.22399 110
3271 39.94760 -75.22946 108
3327 39.97759 -75.12442 101
3183 39.88994 -75.17679 99
3329 40.02651 -75.22445 96
3106 39.99179 -75.18637 95
3113 39.97472 -75.19781 93
3281 39.93230 -75.21620 92
3298 39.94717 -75.20691 91
3354 39.95628 -75.23870 88
3391 39.99667 -75.23482 84
3096 39.99119 -75.17975 82
3284 40.01112 -75.19287 82
3356 39.89322 -75.17187 79
3323 40.02796 -75.22770 78
3272 39.96405 -75.22048 75
3315 39.94235 -75.21138 65
3306 39.99683 -75.17761 64
3247 39.96538 -75.21514 55
3307 40.00280 -75.17400 47
3181 39.89629 -75.17514 44
3280 39.93968 -75.21362 43
3282 39.98628 -75.14925 43
3240 39.96277 -75.21185 42
3278 39.99301 -75.16810 41
3398 39.91551 -75.15474 41
3332 40.00537 -75.18143 32
3196 39.98717 -75.17477 31
3352 40.00143 -75.15256 21
3351 39.99655 -75.14991 20
3305 39.99569 -75.16710 9
3000 39.91591 -75.18370 7
3334 39.93020 -75.15636 3
3292 39.92000 -75.15620 1
3366 39.96693 -75.23359 1
Note:
Total trips originating from the station during 2024 Q4 period.

7. Train/Test Split

i. 2024 Q4 Data Split

Code
# Add ISO week number to panel.
# Week numbers consistent across years.
panel_q4 <- panel_q4 %>%
  mutate(week = as.numeric(week(date)))

# Set split point at Week 49 (early December).
# Gives about 8-9 weeks for training.
split_date <- 49

# Find stations present in early period.
early_stations <- panel_q4 %>%
  filter(week < split_date) %>%
  distinct(start_station_id) %>%
  pull(start_station_id)

# Find stations present in late period.
late_stations <- panel_q4 %>%
  filter(week >= split_date) %>%
  distinct(start_station_id) %>%
  pull(start_station_id)

# Keep only stations present in both periods.
# This ensures we can make predictions for test set.
common_stations_q4 <- intersect(early_stations, late_stations)

# Print station counts.
cat("Early Stations (Weeks < 49):", length(early_stations), "\n")
Early Stations (Weeks < 49): 235 
Code
cat("Late Stations (Weeks >= 49):", length(late_stations), "\n")
Late Stations (Weeks >= 49): 235 
Code
cat("Both Period Stations:", length(common_stations_q4), "\n")
Both Period Stations: 235 
Code
# Create training set from early weeks.
train <- panel_q4 %>%
  filter(start_station_id %in% common_stations_q4, week < split_date)

# Create test set from late weeks.
test <- panel_q4 %>%
  filter(start_station_id %in% common_stations_q4, week >= split_date)

# Print split summary.
cat("\n2024 Q4 Training Observations:", format(nrow(train), big.mark = ","), "\n")

2024 Q4 Training Observations: 338,870 
Code
cat("2024 Q4 Testing Observations:", format(nrow(test), big.mark = ","), "\n")
2024 Q4 Testing Observations: 209,385 
Code
cat("2024 Q4 Training Date Range:", format(min(train$date), big.mark = ","), "to", format(max(train$date), big.mark = ","), "\n")
2024 Q4 Training Date Range: 2024-10-19 to 2024-12-01 
Code
cat("2024 Q4 Testing Date Range:", format(min(test$date), big.mark = ","), "to", format(max(test$date), big.mark = ","), "\n")
2024 Q4 Testing Date Range: 2024-12-13 to 2024-12-31 

ii. 2025 Q1 Data Split

Code
# Add week number to Q1 panel.
panel_q1 <- panel_q1 %>%
  mutate(week = as.numeric(week(date)))

# Set split at Week 10 (early March).
min_week <- 10

# Find stations in early Q1 period.
early_stations_q1 <- panel_q1 %>%
  filter(week < min_week) %>%
  distinct(start_station_id) %>%
  pull(start_station_id)

# Find stations in late Q1 period.
late_stations_q1 <- panel_q1 %>%
  filter(week >= min_week) %>%
  distinct(start_station_id) %>%
  pull(start_station_id)

# Keep only stations in both periods.
common_stations_q1 <- intersect(early_stations_q1, late_stations_q1)

# Print station counts.
cat("Early Stations (Weeks < 10):", length(early_stations_q1), "\n")
Early Stations (Weeks < 10): 235 
Code
cat("Late Stations (Weeks >= 10):", length(late_stations_q1), "\n")
Late Stations (Weeks >= 10): 235 
Code
cat("Both Period Stations:", length(common_stations_q1), "\n")
Both Period Stations: 235 
Code
# Create training set.
train_q1 <- panel_q1 %>%
  filter(start_station_id %in% common_stations_q1, week < min_week)

# Create test set.
test_q1 <- panel_q1 %>%
  filter(start_station_id %in% common_stations_q1, week >= min_week)

# Print split summary.
cat("\n2025 Q1 Training Observations:", format(nrow(train_q1), big.mark = ","), "\n")

2025 Q1 Training Observations: 614,290 
Code
cat("2025 Q1 Testing Observations:", format(nrow(test_q1), big.mark = ","), "\n")
2025 Q1 Testing Observations: 300,565 
Code
cat("2025 Q1 Training Date Range:", format(min(train_q1$date), big.mark = ","), "to", format(max(train_q1$date), big.mark = ","), "\n")
2025 Q1 Training Date Range: 2025-01-08 to 2025-03-04 
Code
cat("2025 Q1 Testing Date Range:", format(min(test_q1$date), big.mark = ","), "to", format(max(test_q1$date), big.mark = ","), "\n")
2025 Q1 Testing Date Range: 2025-03-05 to 2025-03-31 

9. Modeling

i. Preparation and Helper Function

Code
# Define list of predictor variables used across models.
model_predictors <- c(
  "trips", "hour", "dow", "temp", "precip",
  "lag_1hr", "lag_3hr", "lag_1day",
  "med_inc", "pct_public_commute", "pct_white",
  "start_station_id", "feel", "gust", "w_code",
  "rush_hour", "is_weekend"
)

# Define ordered day of week levels for factor.
dow_levels <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")

# Clean Q4 training data.
train_clean <- train %>%
  # Create factor variable for day of week.
  mutate(dow_simple = factor(dow, levels = dow_levels)) %>%
  # Remove rows with any missing predictors.
  drop_na(all_of(model_predictors))

# Set contrast coding for day of week.
# Treatment coding uses first level as reference.
contrasts(train_clean$dow_simple) <- contr.treatment(7)

# Clean Q4 test data.
test_clean <- test %>%
  mutate(dow_simple = factor(dow, levels = dow_levels)) %>%
  drop_na(all_of(model_predictors))

# Clean Q1 training data.
train_q1_clean <- train_q1 %>%
  mutate(dow_simple = factor(dow, levels = dow_levels)) %>%
  drop_na(all_of(model_predictors))

contrasts(train_q1_clean$dow_simple) <- contr.treatment(7)

# Clean Q1 test data.
test_q1_clean <- test_q1 %>%
  mutate(dow_simple = factor(dow, levels = dow_levels)) %>%
  drop_na(all_of(model_predictors))

# Verification.
cat("\nFinal Clean Dataset Dimensions:\n")

Final Clean Dataset Dimensions:
Code
cat("2024 Q4 Train:", dim(train_clean), "\n")
2024 Q4 Train: 168965 59 
Code
cat("2024 Q4 Test:", dim(test_clean), "\n")
2024 Q4 Test: 104575 59 
Code
cat("2025 Q1 Train:", dim(train_q1_clean), "\n")
2025 Q1 Train: 306675 59 
Code
cat("2025 Q1 Test:", dim(test_q1_clean), "\n")
2025 Q1 Test: 149225 59 
Code
# Create helper function to extract model metrics.
get_metrics <- function(model, model_name) {
  # Get model summary object.
  model_summary <- summary(model)

  # Extract adjusted R-Squared.
  # Adjusted R-Squared penalizes for number of predictors.
  adj_r_squared <- model_summary$adj.r.squared

  # Calculate AIC.
  # Lower AIC indicates better model fit.
  aic <- AIC(model)

  # Store results in tibble for easy comparison.
  metrics <- tibble(
    model = model_name,
    adjusted_r_squared = adj_r_squared,
    aic = aic
  )
  return(metrics)
}

ii. 2024 Q4 Models

Code
# Fit model.
model1_q4 <- lm(
  trips ~ as.factor(hour) + dow_simple + temp + precip,
  data = train_clean
)

summary(model1_q4)

Call:
lm(formula = trips ~ as.factor(hour) + dow_simple + temp + precip, 
    data = train_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8076 -0.3681 -0.1740 -0.0065 17.6681 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       -0.0436871  0.0136452  -3.202              0.00137 ** 
as.factor(hour)1  -0.0628116  0.0116820  -5.377      0.0000000759352 ***
as.factor(hour)2  -0.1166714  0.0116800  -9.989 < 0.0000000000000002 ***
as.factor(hour)3  -0.1622147  0.0116861 -13.881 < 0.0000000000000002 ***
as.factor(hour)4  -0.1991111  0.0117929 -16.884 < 0.0000000000000002 ***
as.factor(hour)5  -0.2501664  0.0118112 -21.181 < 0.0000000000000002 ***
as.factor(hour)6  -0.2722295  0.0118928 -22.890 < 0.0000000000000002 ***
as.factor(hour)7  -0.2960674  0.0117879 -25.116 < 0.0000000000000002 ***
as.factor(hour)8  -0.3343246  0.0118805 -28.141 < 0.0000000000000002 ***
as.factor(hour)9  -0.3477408  0.0119001 -29.222 < 0.0000000000000002 ***
as.factor(hour)10 -0.3001567  0.0119457 -25.127 < 0.0000000000000002 ***
as.factor(hour)11 -0.1999966  0.0119981 -16.669 < 0.0000000000000002 ***
as.factor(hour)12 -0.0607885  0.0119225  -5.099      0.0000003424779 ***
as.factor(hour)13 -0.0050487  0.0119608  -0.422              0.67295    
as.factor(hour)14 -0.0720998  0.0119794  -6.019      0.0000000017626 ***
as.factor(hour)15 -0.1223295  0.0119652 -10.224 < 0.0000000000000002 ***
as.factor(hour)16 -0.0786620  0.0119317  -6.593      0.0000000000433 ***
as.factor(hour)17 -0.0197096  0.0118798  -1.659              0.09710 .  
as.factor(hour)18 -0.0003060  0.0118360  -0.026              0.97938    
as.factor(hour)19  0.0497192  0.0118222   4.206      0.0000260556096 ***
as.factor(hour)20  0.1341686  0.0118038  11.367 < 0.0000000000000002 ***
as.factor(hour)21  0.2945753  0.0117953  24.974 < 0.0000000000000002 ***
as.factor(hour)22  0.2795321  0.0117877  23.714 < 0.0000000000000002 ***
as.factor(hour)23  0.1102304  0.0117815   9.356 < 0.0000000000000002 ***
dow_simple2        0.0117222  0.0066564   1.761              0.07823 .  
dow_simple3        0.0330705  0.0063555   5.203      0.0000001958577 ***
dow_simple4       -0.0107866  0.0063792  -1.691              0.09086 .  
dow_simple5       -0.0017493  0.0067712  -0.258              0.79614    
dow_simple6       -0.0039041  0.0068005  -0.574              0.56590    
dow_simple7        0.0201326  0.0065761   3.061              0.00220 ** 
temp               0.0077745  0.0001924  40.417 < 0.0000000000000002 ***
precip             0.7210947  0.2266697   3.181              0.00147 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7046 on 168933 degrees of freedom
Multiple R-squared:  0.07442,   Adjusted R-squared:  0.07425 
F-statistic: 438.1 on 31 and 168933 DF,  p-value: < 0.00000000000000022
Code
# Get metrics.
metrics1_q4 <- get_metrics(model1_q4, "Model 1 2024 Q4")

# Predict test set and calculate MAE.
test_clean$pred1 <- predict(model1_q4, newdata = test_clean)
mae_value <- mean(abs(test_clean$trips - test_clean$pred1), na.rm = TRUE)

# Add MAE to metrics table.
metrics1_q4 <- metrics1_q4 %>% mutate(mae = mae_value)
Code
# Fit model.
model2_q4 <- lm(
  trips ~ as.factor(hour) + dow_simple + temp + precip +
    lag_1hr + lag_3hr + lag_1day,
  data = train_clean
)

summary(model2_q4)

Call:
lm(formula = trips ~ as.factor(hour) + dow_simple + temp + precip + 
    lag_1hr + lag_3hr + lag_1day, data = train_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2509 -0.2825 -0.1115  0.0033 17.2623 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       -0.0884581  0.0128210  -6.899   0.0000000000052375 ***
as.factor(hour)1  -0.0271356  0.0109701  -2.474              0.01338 *  
as.factor(hour)2  -0.0396952  0.0109795  -3.615              0.00030 ***
as.factor(hour)3  -0.0565495  0.0110010  -5.140   0.0000002744385763 ***
as.factor(hour)4  -0.0717866  0.0111148  -6.459   0.0000000001059271 ***
as.factor(hour)5  -0.0974989  0.0111479  -8.746 < 0.0000000000000002 ***
as.factor(hour)6  -0.1002481  0.0112395  -8.919 < 0.0000000000000002 ***
as.factor(hour)7  -0.1087255  0.0111559  -9.746 < 0.0000000000000002 ***
as.factor(hour)8  -0.1249132  0.0112671 -11.087 < 0.0000000000000002 ***
as.factor(hour)9  -0.1251121  0.0113033 -11.069 < 0.0000000000000002 ***
as.factor(hour)10 -0.0816446  0.0113541  -7.191   0.0000000000006469 ***
as.factor(hour)11 -0.0090130  0.0114000  -0.791              0.42917    
as.factor(hour)12  0.0829187  0.0113196   7.325   0.0000000000002395 ***
as.factor(hour)13  0.0898649  0.0113322   7.930   0.0000000000000022 ***
as.factor(hour)14  0.0108794  0.0113158   0.961              0.33633    
as.factor(hour)15 -0.0312387  0.0112705  -2.772              0.00558 ** 
as.factor(hour)16  0.0062357  0.0112307   0.555              0.57873    
as.factor(hour)17  0.0508629  0.0111873   4.546   0.0000054590118127 ***
as.factor(hour)18  0.0576959  0.0111509   5.174   0.0000002292760395 ***
as.factor(hour)19  0.0926763  0.0111323   8.325 < 0.0000000000000002 ***
as.factor(hour)20  0.1467359  0.0111097  13.208 < 0.0000000000000002 ***
as.factor(hour)21  0.2618405  0.0111162  23.555 < 0.0000000000000002 ***
as.factor(hour)22  0.2111317  0.0111109  19.002 < 0.0000000000000002 ***
as.factor(hour)23  0.0681901  0.0110798   6.154   0.0000000007550539 ***
dow_simple2        0.0078849  0.0062504   1.262              0.20713    
dow_simple3        0.0097845  0.0059694   1.639              0.10119    
dow_simple4       -0.0239921  0.0059901  -4.005   0.0000619661205531 ***
dow_simple5       -0.0098509  0.0063580  -1.549              0.12129    
dow_simple6       -0.0046179  0.0063856  -0.723              0.46957    
dow_simple7        0.0079502  0.0061750   1.287              0.19792    
temp               0.0043248  0.0001822  23.733 < 0.0000000000000002 ***
precip             0.4504284  0.2128073   2.117              0.03430 *  
lag_1hr            0.1781513  0.0023805  74.839 < 0.0000000000000002 ***
lag_3hr            0.0968170  0.0023439  41.306 < 0.0000000000000002 ***
lag_1day           0.2131206  0.0023264  91.608 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6614 on 168930 degrees of freedom
Multiple R-squared:  0.1843,    Adjusted R-squared:  0.1841 
F-statistic:  1122 on 34 and 168930 DF,  p-value: < 0.00000000000000022
Code
# Get metrics.
metrics2_q4 <- get_metrics(model2_q4, "Model 2 2024 Q4")

# Predict test set and calculate MAE.
test_clean$pred2 <- predict(model2_q4, newdata = test_clean)
mae_value <- mean(abs(test_clean$trips - test_clean$pred2), na.rm = TRUE)

# Add MAE to metrics table.
metrics2_q4 <- metrics2_q4 %>% mutate(mae = mae_value)
Code
# Fit model.
model3_q4 <- lm(
  trips ~ as.factor(hour) + dow_simple + temp + precip +
    lag_1hr + lag_3hr + lag_1day +
    med_inc + pct_public_commute + pct_white,
  data = train_clean
)

summary(model3_q4)

Call:
lm(formula = trips ~ as.factor(hour) + dow_simple + temp + precip + 
    lag_1hr + lag_3hr + lag_1day + med_inc + pct_public_commute + 
    pct_white, data = train_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0618 -0.2910 -0.1133  0.0297 17.2470 

Coefficients:
                        Estimate    Std. Error t value             Pr(>|t|)    
(Intercept)        -0.1980437235  0.0144754779 -13.681 < 0.0000000000000002 ***
as.factor(hour)1   -0.0288322682  0.0109317650  -2.637             0.008353 ** 
as.factor(hour)2   -0.0440447394  0.0109417863  -4.025  0.00005691055369546 ***
as.factor(hour)3   -0.0627050641  0.0109639030  -5.719  0.00000001071887546 ***
as.factor(hour)4   -0.0792165121  0.0110779958  -7.151  0.00000000000086622 ***
as.factor(hour)5   -0.1063125193  0.0111118636  -9.567 < 0.0000000000000002 ***
as.factor(hour)6   -0.1101516353  0.0112038626  -9.832 < 0.0000000000000002 ***
as.factor(hour)7   -0.1195213970  0.0111212930 -10.747 < 0.0000000000000002 ***
as.factor(hour)8   -0.1370602608  0.0112331974 -12.201 < 0.0000000000000002 ***
as.factor(hour)9   -0.1380975420  0.0112700222 -12.254 < 0.0000000000000002 ***
as.factor(hour)10  -0.0946663358  0.0113206161  -8.362 < 0.0000000000000002 ***
as.factor(hour)11  -0.0209810116  0.0113653448  -1.846             0.064886 .  
as.factor(hour)12   0.0730297565  0.0112835813   6.472  0.00000000009683998 ***
as.factor(hour)13   0.0824309008  0.0112945424   7.298  0.00000000000029271 ***
as.factor(hour)14   0.0045436045  0.0112777013   0.403             0.687034    
as.factor(hour)15  -0.0371763757  0.0112323299  -3.310             0.000934 ***
as.factor(hour)16   0.0008977740  0.0111924873   0.080             0.936069    
as.factor(hour)17   0.0459173998  0.0111490886   4.118  0.00003815457815670 ***
as.factor(hour)18   0.0531512282  0.0111126555   4.783  0.00000172890107239 ***
as.factor(hour)19   0.0889839109  0.0110938025   8.021  0.00000000000000106 ***
as.factor(hour)20   0.1446549314  0.0110710214  13.066 < 0.0000000000000002 ***
as.factor(hour)21   0.2617892994  0.0110773281  23.633 < 0.0000000000000002 ***
as.factor(hour)22   0.2129661694  0.0110721199  19.234 < 0.0000000000000002 ***
as.factor(hour)23   0.0693027415  0.0110410602   6.277  0.00000000034640359 ***
dow_simple2         0.0082986117  0.0062284929   1.332             0.182743    
dow_simple3         0.0111717574  0.0059486683   1.878             0.060379 .  
dow_simple4        -0.0231988482  0.0059691769  -3.886             0.000102 ***
dow_simple5        -0.0092218772  0.0063357296  -1.456             0.145523    
dow_simple6        -0.0043506980  0.0063632075  -0.684             0.494148    
dow_simple7         0.0087583914  0.0061534051   1.423             0.154639    
temp                0.0045158277  0.0001816757  24.857 < 0.0000000000000002 ***
precip              0.4648518131  0.2120629022   2.192             0.028377 *  
lag_1hr             0.1698768224  0.0023842045  71.251 < 0.0000000000000002 ***
lag_3hr             0.0877341004  0.0023504847  37.326 < 0.0000000000000002 ***
lag_1day            0.2040647533  0.0023331026  87.465 < 0.0000000000000002 ***
med_inc             0.0000002482  0.0000000583   4.257  0.00002073017679554 ***
pct_public_commute -0.0432988299  0.0205686385  -2.105             0.035285 *  
pct_white           0.1871290672  0.0099504736  18.806 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6591 on 168927 degrees of freedom
Multiple R-squared:   0.19, Adjusted R-squared:  0.1898 
F-statistic:  1071 on 37 and 168927 DF,  p-value: < 0.00000000000000022
Code
# Get metrics.
metrics3_q4 <- get_metrics(model3_q4, "Model 3 2024 Q4")

# Predict test set and calculate MAE.
test_clean$pred3 <- predict(model3_q4, newdata = test_clean)
mae_value <- mean(abs(test_clean$trips - test_clean$pred3), na.rm = TRUE)

# Add MAE to metrics table.
metrics3_q4 <- metrics3_q4 %>% mutate(mae = mae_value)
Code
# Fit model.
model4_q4 <- lm(
  trips ~ as.factor(hour) + dow_simple + temp + precip +
    lag_1hr + lag_3hr + lag_1day +
    med_inc + pct_public_commute + pct_white +
    as.factor(start_station_id),
  data = train_clean
)

# Summary too long with all station dummies, just show key metrics.
cat("Model 4 R-Squared:", summary(model4_q4)$r.squared, "\n")
Model 4 R-Squared: 0.2179482 
Code
cat("Model 4 Adjusted R-Squared:", summary(model4_q4)$adj.r.squared, "\n")
Model 4 Adjusted R-Squared: 0.2167058 
Code
# Get metrics.
metrics4_q4 <- get_metrics(model4_q4, "Model 4 2024 Q4")

# Predict test set and calculate MAE.
test_clean$pred4 <- predict(model4_q4, newdata = test_clean)
mae_value <- mean(abs(test_clean$trips - test_clean$pred4), na.rm = TRUE)

# Add MAE to metrics table.
metrics4_q4 <- metrics4_q4 %>% mutate(mae = mae_value)
Code
# Fit model.
model5_q4 <- lm(
  trips ~ as.factor(hour) + dow_simple + temp + precip +
    lag_1hr + lag_3hr + lag_1day +
    med_inc + pct_public_commute + pct_white +
    as.factor(start_station_id) +
    rush_hour * is_weekend,
  data = train_clean
)

# Summary too long with all station dummies, just show key metrics.
cat("Model 5 R-Squared:", summary(model5_q4)$r.squared, "\n")
Model 5 R-Squared: 0.2185756 
Code
cat("Model 5 Adjusted R-Squared:", summary(model5_q4)$adj.r.squared, "\n")
Model 5 Adjusted R-Squared: 0.2173295 
Code
# Get metrics.
metrics5_q4 <- get_metrics(model5_q4, "Model 5 2024 Q4")

# Predict test set and calculate MAE.
test_clean$pred5 <- predict(model5_q4, newdata = test_clean)
mae_value <- mean(abs(test_clean$trips - test_clean$pred5), na.rm = TRUE)

# Add MAE to metrics table.
metrics5_q4 <- metrics5_q4 %>% mutate(mae = mae_value)

ii. 2025 Q1 Models

Code
# Fit model.
model1_q1 <- lm(
  trips ~ as.factor(hour) + dow_simple + temp + precip,
  data = train_q1_clean
)

summary(model1_q1)

Call:
lm(formula = trips ~ as.factor(hour) + dow_simple + temp + precip, 
    data = train_q1_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.4958 -0.2179 -0.1240 -0.0306 13.7295 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        0.1222234  0.0058034  21.061 < 0.0000000000000002 ***
as.factor(hour)1  -0.0598351  0.0060596  -9.874 < 0.0000000000000002 ***
as.factor(hour)2  -0.1042802  0.0060326 -17.286 < 0.0000000000000002 ***
as.factor(hour)3  -0.1162613  0.0060334 -19.270 < 0.0000000000000002 ***
as.factor(hour)4  -0.1330465  0.0060328 -22.054 < 0.0000000000000002 ***
as.factor(hour)5  -0.1635819  0.0060611 -26.989 < 0.0000000000000002 ***
as.factor(hour)6  -0.1789510  0.0061171 -29.254 < 0.0000000000000002 ***
as.factor(hour)7  -0.1923737  0.0062101 -30.978 < 0.0000000000000002 ***
as.factor(hour)8  -0.2137861  0.0066647 -32.077 < 0.0000000000000002 ***
as.factor(hour)9  -0.2238374  0.0063651 -35.166 < 0.0000000000000002 ***
as.factor(hour)10 -0.2111798  0.0060571 -34.865 < 0.0000000000000002 ***
as.factor(hour)11 -0.1460043  0.0060774 -24.024 < 0.0000000000000002 ***
as.factor(hour)12 -0.1024646  0.0060874 -16.832 < 0.0000000000000002 ***
as.factor(hour)13  0.0402710  0.0060996   6.602     0.00000000004057 ***
as.factor(hour)14 -0.0157936  0.0061074  -2.586             0.009710 ** 
as.factor(hour)15 -0.0809547  0.0061033 -13.264 < 0.0000000000000002 ***
as.factor(hour)16 -0.0855351  0.0060925 -14.039 < 0.0000000000000002 ***
as.factor(hour)17 -0.0355088  0.0060736  -5.846     0.00000000502728 ***
as.factor(hour)18 -0.0245748  0.0060567  -4.057     0.00004962423660 ***
as.factor(hour)19 -0.0202258  0.0060495  -3.343             0.000828 ***
as.factor(hour)20  0.0217290  0.0060411   3.597             0.000322 ***
as.factor(hour)21  0.0647025  0.0060365  10.719 < 0.0000000000000002 ***
as.factor(hour)22  0.1914266  0.0060349  31.720 < 0.0000000000000002 ***
as.factor(hour)23  0.0877243  0.0060346  14.537 < 0.0000000000000002 ***
dow_simple2        0.0231868  0.0033073   7.011     0.00000000000237 ***
dow_simple3        0.0156091  0.0033233   4.697     0.00000264199760 ***
dow_simple4        0.0043930  0.0033097   1.327             0.184400    
dow_simple5        0.0205737  0.0032843   6.264     0.00000000037525 ***
dow_simple6       -0.0285507  0.0032727  -8.724 < 0.0000000000000002 ***
dow_simple7       -0.0478633  0.0032768 -14.607 < 0.0000000000000002 ***
temp               0.0033126  0.0001051  31.522 < 0.0000000000000002 ***
precip            -1.4246730  0.1264642 -11.265 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.487 on 306643 degrees of freedom
Multiple R-squared:  0.04998,   Adjusted R-squared:  0.04988 
F-statistic: 520.4 on 31 and 306643 DF,  p-value: < 0.00000000000000022
Code
# Get metrics.
metrics1_q1 <- get_metrics(model1_q1, "Model 1 2025 Q1")

# Predict test set and calculate MAE.
test_q1_clean$pred1 <- predict(model1_q1, newdata = test_q1_clean)
mae_value <- mean(abs(test_q1_clean$trips - test_q1_clean$pred1), na.rm = TRUE)

# Add MAE to metrics table.
metrics1_q1 <- metrics1_q1 %>% mutate(mae = mae_value)
Code
# Fit model.
model2_q1 <- lm(
  trips ~ as.factor(hour) + dow_simple + temp + precip +
    lag_1hr + lag_3hr + lag_1day,
  data = train_q1_clean
)

summary(model2_q1)

Call:
lm(formula = trips ~ as.factor(hour) + dow_simple + temp + precip + 
    lag_1hr + lag_3hr + lag_1day, data = train_q1_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0806 -0.1845 -0.0892 -0.0170 13.3446 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        0.0568434  0.0056292  10.098 < 0.0000000000000002 ***
as.factor(hour)1  -0.0464379  0.0058652  -7.917  0.00000000000000243 ***
as.factor(hour)2  -0.0654205  0.0058400 -11.202 < 0.0000000000000002 ***
as.factor(hour)3  -0.0598442  0.0058453 -10.238 < 0.0000000000000002 ***
as.factor(hour)4  -0.0669974  0.0058489 -11.455 < 0.0000000000000002 ***
as.factor(hour)5  -0.0875594  0.0058820 -14.886 < 0.0000000000000002 ***
as.factor(hour)6  -0.0937213  0.0059417 -15.774 < 0.0000000000000002 ***
as.factor(hour)7  -0.1011217  0.0060353 -16.755 < 0.0000000000000002 ***
as.factor(hour)8  -0.1153840  0.0064777 -17.812 < 0.0000000000000002 ***
as.factor(hour)9  -0.1179709  0.0061960 -19.040 < 0.0000000000000002 ***
as.factor(hour)10 -0.1028264  0.0059032 -17.419 < 0.0000000000000002 ***
as.factor(hour)11 -0.0429344  0.0059204  -7.252  0.00000000000041189 ***
as.factor(hour)12 -0.0159560  0.0059209  -2.695              0.00704 ** 
as.factor(hour)13  0.1034144  0.0059234  17.459 < 0.0000000000000002 ***
as.factor(hour)14  0.0130688  0.0059166   2.209              0.02719 *  
as.factor(hour)15 -0.0426636  0.0059090  -7.220  0.00000000000052068 ***
as.factor(hour)16 -0.0428210  0.0058978  -7.261  0.00000000000038649 ***
as.factor(hour)17  0.0100745  0.0058797   1.713              0.08663 .  
as.factor(hour)18  0.0139539  0.0058622   2.380              0.01730 *  
as.factor(hour)19  0.0167151  0.0058550   2.855              0.00431 ** 
as.factor(hour)20  0.0519245  0.0058440   8.885 < 0.0000000000000002 ***
as.factor(hour)21  0.0820044  0.0058376  14.048 < 0.0000000000000002 ***
as.factor(hour)22  0.1917855  0.0058368  32.858 < 0.0000000000000002 ***
as.factor(hour)23  0.0641552  0.0058381  10.989 < 0.0000000000000002 ***
dow_simple2        0.0097238  0.0031984   3.040              0.00236 ** 
dow_simple3       -0.0007425  0.0032153  -0.231              0.81737    
dow_simple4       -0.0091244  0.0032017  -2.850              0.00437 ** 
dow_simple5        0.0062319  0.0031765   1.962              0.04978 *  
dow_simple6       -0.0369855  0.0031670 -11.679 < 0.0000000000000002 ***
dow_simple7       -0.0415423  0.0031691 -13.108 < 0.0000000000000002 ***
temp               0.0021586  0.0001019  21.183 < 0.0000000000000002 ***
precip            -1.4322051  0.1222831 -11.712 < 0.0000000000000002 ***
lag_1hr            0.1571103  0.0017899  87.774 < 0.0000000000000002 ***
lag_3hr            0.0830757  0.0017809  46.649 < 0.0000000000000002 ***
lag_1day           0.1401371  0.0017847  78.522 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4708 on 306640 degrees of freedom
Multiple R-squared:  0.1123,    Adjusted R-squared:  0.1122 
F-statistic:  1141 on 34 and 306640 DF,  p-value: < 0.00000000000000022
Code
# Get metrics.
metrics2_q1 <- get_metrics(model2_q1, "Model 2 2025 Q1")

# Predict test set and calculate MAE.
test_q1_clean$pred2 <- predict(model2_q1, newdata = test_q1_clean)
mae_value <- mean(abs(test_q1_clean$trips - test_q1_clean$pred2), na.rm = TRUE)

# Add MAE to metrics table.
metrics2_q1 <- metrics2_q1 %>% mutate(mae = mae_value)
Code
# Fit model.
model3_q1 <- lm(
  trips ~ as.factor(hour) + dow_simple + temp + precip +
    lag_1hr + lag_3hr + lag_1day +
    med_inc + pct_public_commute + pct_white,
  data = train_q1_clean
)

summary(model3_q1)

Call:
lm(formula = trips ~ as.factor(hour) + dow_simple + temp + precip + 
    lag_1hr + lag_3hr + lag_1day + med_inc + pct_public_commute + 
    pct_white, data = train_q1_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9933 -0.1929 -0.0894 -0.0065 13.3407 

Coefficients:
                         Estimate     Std. Error t value             Pr(>|t|)
(Intercept)        -0.02781053027  0.00664762707  -4.184  0.00002871001436634
as.factor(hour)1   -0.04668997775  0.00584373374  -7.990  0.00000000000000136
as.factor(hour)2   -0.06750909317  0.00581874808 -11.602 < 0.0000000000000002
as.factor(hour)3   -0.06323653164  0.00582433137 -10.857 < 0.0000000000000002
as.factor(hour)4   -0.07117085340  0.00582810575 -12.212 < 0.0000000000000002
as.factor(hour)5   -0.09247095018  0.00586138867 -15.776 < 0.0000000000000002
as.factor(hour)6   -0.09918997905  0.00592099473 -16.752 < 0.0000000000000002
as.factor(hour)7   -0.10700290035  0.00601443519 -17.791 < 0.0000000000000002
as.factor(hour)8   -0.12176892528  0.00645538012 -18.863 < 0.0000000000000002
as.factor(hour)9   -0.12488070739  0.00617501224 -20.224 < 0.0000000000000002
as.factor(hour)10  -0.10994289841  0.00588351761 -18.687 < 0.0000000000000002
as.factor(hour)11  -0.04983762170  0.00590051241  -8.446 < 0.0000000000000002
as.factor(hour)12  -0.02197384413  0.00590053299  -3.724             0.000196
as.factor(hour)13   0.09878530765  0.00590249324  16.736 < 0.0000000000000002
as.factor(hour)14   0.01056317646  0.00589517033   1.792             0.073160
as.factor(hour)15  -0.04547825905  0.00588764829  -7.724  0.00000000000001128
as.factor(hour)16  -0.04527061896  0.00587639671  -7.704  0.00000000000001325
as.factor(hour)17   0.00726910855  0.00585846143   1.241             0.214685
as.factor(hour)18   0.01127814588  0.00584096977   1.931             0.053500
as.factor(hour)19   0.01410787396  0.00583381860   2.418             0.015594
as.factor(hour)20   0.04989843752  0.00582270902   8.570 < 0.0000000000000002
as.factor(hour)21   0.08072614910  0.00581627986  13.879 < 0.0000000000000002
as.factor(hour)22   0.19147128023  0.00581544456  32.925 < 0.0000000000000002
as.factor(hour)23   0.06527930706  0.00581677190  11.223 < 0.0000000000000002
dow_simple2         0.01062542339  0.00318672268   3.334             0.000855
dow_simple3         0.00033522053  0.00320358762   0.105             0.916662
dow_simple4        -0.00822933033  0.00319006246  -2.580             0.009890
dow_simple5         0.00719005750  0.00316491157   2.272             0.023099
dow_simple6        -0.03643393359  0.00315536796 -11.547 < 0.0000000000000002
dow_simple7        -0.04198724753  0.00315752752 -13.298 < 0.0000000000000002
temp                0.00223495042  0.00010154217  22.010 < 0.0000000000000002
precip             -1.43214299982  0.12183500303 -11.755 < 0.0000000000000002
lag_1hr             0.14886988741  0.00179177742  83.085 < 0.0000000000000002
lag_3hr             0.07438060713  0.00178373229  41.699 < 0.0000000000000002
lag_1day            0.13187012801  0.00178662788  73.810 < 0.0000000000000002
med_inc             0.00000018371  0.00000003079   5.967  0.00000000242500380
pct_public_commute -0.00527004470  0.01086403632  -0.485             0.627612
pct_white           0.13995030019  0.00524992297  26.658 < 0.0000000000000002
                      
(Intercept)        ***
as.factor(hour)1   ***
as.factor(hour)2   ***
as.factor(hour)3   ***
as.factor(hour)4   ***
as.factor(hour)5   ***
as.factor(hour)6   ***
as.factor(hour)7   ***
as.factor(hour)8   ***
as.factor(hour)9   ***
as.factor(hour)10  ***
as.factor(hour)11  ***
as.factor(hour)12  ***
as.factor(hour)13  ***
as.factor(hour)14  .  
as.factor(hour)15  ***
as.factor(hour)16  ***
as.factor(hour)17     
as.factor(hour)18  .  
as.factor(hour)19  *  
as.factor(hour)20  ***
as.factor(hour)21  ***
as.factor(hour)22  ***
as.factor(hour)23  ***
dow_simple2        ***
dow_simple3           
dow_simple4        ** 
dow_simple5        *  
dow_simple6        ***
dow_simple7        ***
temp               ***
precip             ***
lag_1hr            ***
lag_3hr            ***
lag_1day           ***
med_inc            ***
pct_public_commute    
pct_white          ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4691 on 306637 degrees of freedom
Multiple R-squared:  0.1188,    Adjusted R-squared:  0.1187 
F-statistic:  1118 on 37 and 306637 DF,  p-value: < 0.00000000000000022
Code
# Get metrics.
metrics3_q1 <- get_metrics(model3_q1, "Model 3 2025 Q1")

# Predict test set and calculate MAE.
test_q1_clean$pred3 <- predict(model3_q1, newdata = test_q1_clean)
mae_value <- mean(abs(test_q1_clean$trips - test_q1_clean$pred3), na.rm = TRUE)

# Add MAE to metrics table.
metrics3_q1 <- metrics3_q1 %>% mutate(mae = mae_value)
Code
# Fit model.
model4_q1 <- lm(
  trips ~ as.factor(hour) + dow_simple + temp + precip +
    lag_1hr + lag_3hr + lag_1day +
    med_inc + pct_public_commute + pct_white +
    as.factor(start_station_id),
  data = train_q1_clean
)

# Summary too long with all station dummies, just show key metrics.
cat("Model 4 R-Squared:", summary(model4_q1)$r.squared, "\n")
Model 4 R-Squared: 0.1511393 
Code
cat("Model 4 Adjusted R-Squared:", summary(model4_q1)$adj.r.squared, "\n")
Model 4 Adjusted R-Squared: 0.1503969 
Code
# Get metrics.
metrics4_q1 <- get_metrics(model4_q1, "Model 4 2025 Q1")

# Predict test set and calculate MAE.
test_q1_clean$pred4 <- predict(model4_q1, newdata = test_q1_clean)
mae_value <- mean(abs(test_q1_clean$trips - test_q1_clean$pred4), na.rm = TRUE)

# Add MAE to metrics table.
metrics4_q1 <- metrics4_q1 %>% mutate(mae = mae_value)
Code
# Fit model.
model5_q1 <- lm(
  trips ~ as.factor(hour) + dow_simple + temp + precip +
    lag_1hr + lag_3hr + lag_1day +
    med_inc + pct_public_commute + pct_white +
    as.factor(start_station_id) +
    rush_hour * is_weekend,
  data = train_q1_clean
)

# Summary too long with all station dummies, just show key metrics.
cat("Model 5 R-Squared:", summary(model5_q1)$r.squared, "\n")
Model 5 R-Squared: 0.1521488 
Code
cat("Model 5 Adjusted R-Squared:", summary(model5_q1)$adj.r.squared, "\n")
Model 5 Adjusted R-Squared: 0.1514044 
Code
# Get metrics.
metrics5_q1 <- get_metrics(model5_q1, "Model 5 2025 Q1")

# Predict test set and calculate MAE.
test_q1_clean$pred5 <- predict(model5_q1, newdata = test_q1_clean)
mae_value <- mean(abs(test_q1_clean$trips - test_q1_clean$pred5), na.rm = TRUE)

# Add MAE to metrics table.
metrics5_q1 <- metrics5_q1 %>% mutate(mae = mae_value)
Code
# Combine all model metrics into single table.
final_metrics <- bind_rows(
  metrics1_q4, metrics2_q4, metrics3_q4, metrics4_q4, metrics5_q4,
  metrics1_q1, metrics2_q1, metrics3_q1, metrics4_q1, metrics5_q1
)

# Sort by MAE ascending.
final_metrics <- final_metrics %>%
  arrange(mae)

# Create formatted table.
table_final <- final_metrics %>%
  mutate(
    # Round metrics for.
    adjusted_r_squared = round(adjusted_r_squared, 4),
    aic = scales::comma(round(aic, 0)),
    mae = round(mae, 4)
  ) %>%
  select(model, mae, adjusted_r_squared, aic) %>%
  kable(
    caption = "Model Performance Comparison (2024 Q4 vs. 2025 Q1)",
    col.names = c("Model and Features", "MAE (Trips / 30-Min)", "Adjusted R²", "AIC"),
    align = c("l", "c", "c", "c", "c")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"), 
    full_width = FALSE
  ) %>%
  footnote(
    general = "Average predicted trip count error per station-hour.\nSorted ascending.",
    general_title = "MAE (Mean Absolute Error): Lower is better.",
    footnote_as_chunk = FALSE
  )

table_final
Model Performance Comparison (2024 Q4 vs. 2025 Q1)
Model and Features MAE (Trips / 30-Min) Adjusted R² AIC
Model 2 2024 Q4 0.2508 0.1841 339,853
Model 3 2024 Q4 0.2574 0.1898 338,671
Model 4 2024 Q4 0.2770 0.2167 333,199
Model 5 2024 Q4 0.2773 0.2173 333,065
Model 1 2024 Q4 0.2985 0.0742 361,196
Model 4 2025 Q1 0.3378 0.1504 395,022
Model 5 2025 Q1 0.3382 0.1514 394,659
Model 3 2025 Q1 0.3436 0.1187 406,017
Model 2 2025 Q1 0.3443 0.1122 408,265
Model 1 2025 Q1 0.3677 0.0499 429,076
MAE (Mean Absolute Error): Lower is better.
Average predicted trip count error per station-hour.
Sorted ascending.
Code
# Prepare data for MAE comparison plot.
mae_plot_data <- final_metrics %>%
  mutate(
    # Create readable model labels for x-axis.
    model_label = case_when(
      grepl("Model 1", model) ~ "1. Time / Weather",
      grepl("Model 2", model) ~ "2. + Lags",
      grepl("Model 3", model) ~ "3. + Demographics",
      grepl("Model 4", model) ~ "4. + Station FE",
      grepl("Model 5", model) ~ "5. + Interaction"
    ),
    # Extract quarter for faceting.
    forecast_type = ifelse(
      grepl("2024 Q4", model),
      "2024 Q4",
      "2025 Q1"
    )
  )

# Create bar plot comparing MAE across models.
mae_comp_plot <- ggplot(mae_plot_data, aes(
  # Order bars by MAE value.
  x = reorder(model_label, -mae),
  y = mae,
  fill = forecast_type
)) +
  # Create bars with border.
  geom_col(alpha = 1,
    colour = "#3d3b3c",
    linewidth = 1) +
  # Add value labels on top of bars.
  geom_text(
    aes(label = round(mae, 3)),
    vjust = -0.3,
    size = 6,
    color = "#2d2a26",
    family = "anonymous"
  ) +
  # Separate facets for Q4 and Q1.
  facet_wrap(~ forecast_type, scales = "free_x", ncol = 2) +
  # Define colors for quarters.
  scale_fill_manual(values = c(
    "2024 Q4" = "#f6a2a7",
    "2025 Q1" = "#8bcef2"
  )) +
  # Format y-axis.
  scale_y_continuous(labels = scales::number_format(accuracy = 0.001)) +
  labs(
    title = "Model Prediction Accuracy Comparison",
    subtitle = "Mean Absolute Error (MAE): 2024 Q4 (2024 Q4) vs. 2025 Q1 models",
    x = "Model Complexity",
    y = "Mean Absolute Error (Trips per 30-Min)",
    fill = NULL
  ) +
  theme_plot() +
  theme(
    # Angle x-axis labels.
    axis.text.x = element_text(angle = 45, hjust = 1, size = 20),
    # Hide legend since colors labeled in facets.
    legend.position = "none",
    # Format facet labels.
    strip.text = element_text(
      face = "bold",
      size = 20,
      family = "outfit",
      color = "#2d2a26"
    )
  )

mae_comp_plot

From the model performance comparison table and bar chart, it looks like the 2024 Q4 models very visibly outperformed their counterpart 2025 Q1 models by all metrics. With model 2 having the lowest MAE for Q4 and models 4 and 5 for Q1. Both show that significant dip down in MAE value after the first model and significant higher jump with adjusted R-squared.

However, it is only the Q4 models that noticeably show a steady decrease in MAE as opposed to Q1 that looks more stagnant. However, it should be noted that Q4’s train size is around half that of Q1’s, so this difference likely contributes to the performance difference—the lower number of Indego trips is to be expected in Q4 going into the colder winter months versus Q1 creeping out of the cold and into spring from the earlier temperature plot, so clearly there is a seasonal difference.

Also, from the OLS printouts, it looks like precipitation is the strongest, and also significant, variable with the lowest coefficient. This makes sense because precipitation often brings other weather-related issues with it depending on different regions, but in Philadelphia’s case it could be a thunderstorm, flood risk in certain neighborhoods, and contributes to ice build-up, making roads and sidewalks more dangerous in Q4.

PART II: ERROR ANALYSIS

1. Spatial Patterns

Code
# Prepare test set for predictions.
# Create factor variable with proper contrasts.
test_clean <- test_clean %>%
  mutate(dow_simple = factor(dow, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set treatment contrasts for factor.
contrasts(test_clean$dow_simple) <- contr.treatment(7)

# Generate predictions using Model 2.
test_clean <- test_clean %>%
  mutate(
    model2_q4_pred = predict(model2_q4, newdata = test_clean)
  )

# Calculate error metrics.
test_q4_error <- test_clean %>%
  mutate(
    # Raw error (actual - predicted).
    error = trips - model2_q4_pred,
    # Absolute error for MAE calculation.
    abs_error = abs(error),
    # Categorize hour into time periods.
    time_of_day = case_when(
      hour < 7 ~ "Overnight", # 00:00 to 06:59
      hour >= 7 & hour < 10 ~ "AM Rush", # 07:00 to 09:59
      hour >= 10 & hour < 15 ~ "Mid-Day", # 10:00 to 14:59
      hour >= 15 & hour <= 18 ~ "PM Rush", # 15:00 to 18:59
      hour > 18 ~ "Evening" # 19:00 to 23:59
    )
  )

# Aggregate errors to station level.
station_errors_q4 <- test_q4_error %>%
  group_by(start_station_id, start_lat, start_lon) %>%
  summarize(
    MAE = mean(abs_error, na.rm = TRUE),
    # Average demand per station.
    avg_demand = mean(trips, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  # Remove stations with missing coordinates.
  filter(!is.na(start_lat), !is.na(start_lon))

# Map of prediction errors.
pred_error_q4_map <- ggplot() +
  # Census tracts as background.
  geom_sf(data = philly_census, fill = "#2d2a26", color = "#ff4100", linewidth = 0.35) +
  # Stations colored by MAE.
  geom_point(
    data = station_errors_q4,
    aes(x = start_lon, y = start_lat, color = MAE),
    size = 2,
    alpha = 0.75
  ) +
  scale_color_viridis(
    option = "mako",
    name = "MAE (Trips)",
    direction = -1
  ) +
  labs(
    title = "Model Prediction Errors",
    subtitle = "MAE per Station (Q4 2024 Test Set)"
  ) +
  theme_map()

# Create map of average demand.
avg_dem_q4_map <- ggplot() +
  geom_sf(data = philly_census, fill = "#2d2a26", color = "#ff4100", linewidth = 0.35) +
  geom_point(
    data = station_errors_q4,
    aes(x = start_lon, y = start_lat, color = avg_demand),
    size = 2,
    alpha = 0.75
  ) +
  scale_color_viridis(
    option = "mako",
    name = "Average Demand (Trips per 30-Min)",
    direction = -1
  ) +
  labs(
    title = "Average Station Demand",
    subtitle = "Trips per Station-30-Min (2024 Q4 Test Set)"
  ) +
  theme_map()

# Display maps side by side.
pred_error_q4_map | avg_dem_q4_map

It’s clear from the MAE maps that highest MAE and largest demand are in Philadelphia’s urban core, and that gradient changes moving outward like a ring, with less MAE and less demand on Center City’s peripheral neighbors. It also looks like, while high MAE and denser trip demand are in the central portion, that it actually is just slightly south below Market Street around Washington Square.

This is likely due to different demand patterns in the business district as opposed to residential areas outside the bustle of the city center—the larger MAE and demand are intertwined, with more demand volume comes more complex human behavior and bike use.

2. Temporal Patterns

Code
# Create scatter plot of observed vs predicted.
obs_vs_pred_plot <- ggplot(test_q4_error, aes(x = trips, y = model2_q4_pred)) +
  # Plot points with transparency to show density.
  geom_point(alpha = 0.25, color = "#1492D3") +
  # Add 45-degree line perfect predictions.
  geom_abline(slope = 1, intercept = 0, color = "#F03E36", linewidth = 1, linetype = "dashed") +
  # Add OLS fit line to show systematic bias.
  geom_smooth(method = "lm", se = FALSE, color = "#00a557", linetype = "dashed") +
  # Facet by weekend and time of day.
  facet_grid(is_weekend ~ time_of_day) +
  labs(
    title = "Observed vs. Predicted Bike Trips (2024 Q4 Model 2)",
    subtitle = "Red Line = Perfect Predictions\nGreen Line = OLS Fit",
    caption = "Performance grouped by time of day and weekday / weekend.",
    x = "Observed Trips (Trips per 30-Min)",
    y = "Predicted Trips (Model 2)"
  ) +
  theme_plot() +
  theme(
    axis.text.x = element_text(hjust = 1, size = 20),
    legend.position = "none",
    strip.text = element_text(
      face = "bold",
      size = 20,
      family = "anonymous",
      color = "#2d2a26"
    )
  )

obs_vs_pred_plot

Code
# Calculate MAE by time period and day type.
temporal_errors_q4 <- test_q4_error %>%
  group_by(time_of_day, is_weekend) %>%
  summarize(
    MAE = mean(abs_error, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  # Day type label.
  mutate(day_type = ifelse(is_weekend, "Weekend", "Weekday"))

# Bar plot of temporal errors.
temp_error_plot <- ggplot(temporal_errors_q4, aes(x = time_of_day, y = MAE, fill = day_type)) +
  # Dodged bars to compare weekday vs weekend.
  geom_col(position = "dodge",
    colour = "#3d3b3c",
    linewidth = 1) +
  # Add value labels on top of bars.
  geom_text(
    aes(label = round(MAE, 3)),
    size = 6,
    color = "#2d2a26",
    family = "anonymous"
  ) +
  # Colors for day types.
  scale_fill_manual(values = c("Weekday" = "#f6a2a7", "Weekend" = "#8bcef2")) +
  labs(
    title = "Prediction Errors by Time Period",
    subtitle = "2024 Q4",
    caption = "MAE is highest during commuter peaks.",
    x = "Time of Day",
    y = "Mean Absolute Error (Trips)",
    fill = "Day Type"
  ) +
  theme_plot() +
  # Angle labels.
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

temp_error_plot

As with Q1, it looks like Q4 struggles with predicting commuter peaks around evening, mid-day, and PM rushes that may affect commuters with office jobs or shift work. Overnight may have a lower MAE due to a strong demand signal being absent compared to other categories, so the zero-inflation likely makes it easy to predict. This may be a similar case to the AM rush having the smallest MAE, the 1-day lag must help it interpret that day shifts are the most rigid for office workers.

The model is least accurate during the evening, so this must be a time where a lot of the behavior and demand are varied and high, unsurprising as this time slot captures times outside of work schedules, which has less MAE in the PM slot compared.

3. Demographic Patterns

Code
# Join demographic data to station errors.
station_errors_demo_q4 <- station_errors_q4 %>%
  left_join(
    station_census_lookup %>%
      select(start_station_id, med_inc, pct_public_commute, pct_white),
    by = "start_station_id"
  ) %>%
  # Keep only stations with census data.
  filter(!is.na(med_inc))

# Scatter plot of MAE vs median income.
mae_inc_q4_map <- ggplot(station_errors_demo_q4, aes(x = med_inc, y = MAE)) +
  # Plot points.
  geom_point(alpha = 0.5, color = "#011f5b") +
  # Add linear trend line.
  geom_smooth(method = "lm", se = FALSE, color = "#990000", linetype = "dashed") +
  # Format x-axis as dollar amounts.
  scale_x_continuous(labels = scales::dollar) +
  labs(
    title = "Prediction Errors vs. Median Income",
    subtitle = "2024 Q4",
    x = "Median Income.",
    y = "MAE"
  ) +
  theme_plot()

# Scatter plot of MAE vs transit usage.
mae_pub_q4_map <- ggplot(station_errors_demo_q4, aes(x = pct_public_commute, y = MAE)) +
  geom_point(alpha = 0.5, color = "#011f5b") +
  geom_smooth(method = "lm", se = FALSE, color = "#990000", linetype = "dashed") +
  labs(
    title = "Prediction Errors vs. Transit Usage",
    subtitle = "2024 Q4",
    x = "% Taking Transit.",
    y = "MAE"
  ) +
  theme_plot()

# Scatter plot of MAE vs race.
mae_yt_q4_demo <- ggplot(station_errors_demo_q4, aes(x = pct_white, y = MAE)) +
  geom_point(alpha = 0.5, color = "#011f5b") +
  geom_smooth(method = "lm", se = FALSE, color = "#990000", linetype = "dashed") +
  labs(
    title = "Prediction Errors vs. Percent White",
    subtitle = "2024 Q4",
    x = "% White",
    y = "MAE"
  ) +
  theme_plot()

# Stack all demographic plots vertically.
mae_inc_q4_map / mae_pub_q4_map / mae_yt_q4_demo

It looks like the model’s MAE increases as median income increases, and by extension as the percentage of white population increases. On the other hand, MAE decreases as the percentage of public transit commuters increases.

From this, the model struggles with higher income and higher white populations, it could be due to the fact that these populations might not rely as much on public transit, but it is even more likely the correlation is masked by something else.

Referring back to the map of MAE density and demand density, the business district was mentioned, and it is likely the percent white and high median income variables are acting as proxies to mask the fact that Center City, while a residential neighborhood, is a core business district.

Unfortunately this has equity implications if the model were to be deployed, and that is service prioritization in areas where they might not need it. There are already dense bike stations and a variety of different public transit services there, so it could reinforce amenity disparities outside Center City.

PART III: MODEL IMPROVEMENT

Code
# Define reference date for temporal trend.
start_of_q4 <- as.Date("2024-10-01")

# Add new features to training set.
train_clean <- train_clean %>%
  mutate(
    # Calculate days since Q4 started.
    # Captures trend within quarter.
    days_since_oct1 = as.numeric(date - start_of_q4),
    # Flag very cold temperatures.
    # Below 37°F may deter bikers.
    is_too_cold = ifelse(temp < 37, 1, 0), 
    # Flag late evening.
    is_evening = ifelse(hour %in% 19:23, 1, 0),
    # Flag mid-day.
    is_mid_day = ifelse(hour %in% 10:15, 1, 0),
    # Flag morning.
    is_morning = ifelse(hour %in% 7:9, 1, 0)
    )

# Add same features to test set.
test_clean <- test_clean %>%
  mutate(
    days_since_oct1 = as.numeric(date - start_of_q4),
    is_too_cold = ifelse(temp < 37, 1, 0),
    is_evening = ifelse(hour %in% 19:23, 1, 0),
    is_mid_day = ifelse(hour %in% 10:15, 1, 0),
    is_morning = ifelse(hour %in% 7:9, 1, 0)
    )

# Add non-white percent.
train_clean$pct_poc <- 1 - train_clean$pct_white
test_clean$pct_poc <- 1 - test_clean$pct_white
Code
# Fit updated model with new features.
update_model <- lm(
  trips ~ as.factor(hour) + dow_simple + temp + precip +
    # Add temporal trend variable.
    days_since_oct1 + is_too_cold +
    # Add time slot markers.
    is_evening + is_mid_day + is_morning +
    lag_1hr + lag_3hr + lag_1day +
    # Add interaction between cold and time.
    # Cold might matter more as winter progresses.
    (is_too_cold * days_since_oct1) +
    # Add interaction between cold and morning, evening, mid-day.
    (is_too_cold * is_evening) +
    (is_too_cold * is_mid_day) +
    (is_too_cold * is_morning) +
    # Individuals may be more likely to take evening transit.
    (pct_public_commute : is_evening) +
    # Non-white individuals will generally be later.
    (lag_30min : pct_poc) + (lag_1hr : pct_poc) + (lag_1hr30min : pct_poc) +
    (lag_2hr : pct_poc) + (lag_2hr30min : pct_poc) +
    (lag_3hr : pct_poc) + (lag_1day : pct_poc),
  data = train_clean
  )

# Predictions on test set.
test_clean$pred_update <- predict(update_model, newdata = test_clean)
# Calculate MAE for updated model.
mae_update <- mean(abs(test_clean$trips - test_clean$pred_update), na.rm = TRUE)
# Get original Model 2 MAE for comparison.
mae_model2 <- mean(abs(test_clean$trips - test_clean$pred2), na.rm = TRUE)

cat("Model 2 MAE:", round(mae_model2, 4), "\n")
Model 2 MAE: 0.2508 
Code
cat("Updated Model 2 MAE:", round(mae_update, 4), "\n")
Updated Model 2 MAE: 0.227 
Code
cat("Improvement:", round(mae_model2 - mae_update, 4), "trips per hour\n")
Improvement: 0.0239 trips per hour
Code
cat("Percent Improvement:", round(100 * (mae_model2 - mae_update) / mae_model2, 2), "%\n")
Percent Improvement: 9.51 %
Code
# Poisson assumes integer counts and handles overdispersion better.
poisson_model <- glm(
  trips ~ as.factor(hour) + dow_simple + temp + precip +
    # Add temporal trend variable.
    days_since_oct1 + is_too_cold +
    # Add time slot markers.
    is_evening + is_mid_day + is_morning +
    lag_1hr + lag_3hr + lag_1day +
    # Add interaction between cold and time.
    # Cold might matter more as winter progresses.
    (is_too_cold * days_since_oct1) +
    # Add interaction between cold and morning, evening, mid-day.
    (is_too_cold * is_evening) +
    (is_too_cold * is_mid_day) +
    (is_too_cold * is_morning) +
    # Individuals may be more likely to take evening transit.
    (pct_public_commute : is_evening) +
    # Non-white individuals will generally be later.
    (lag_30min : pct_poc) + (lag_1hr : pct_poc) + (lag_1hr30min : pct_poc) +
    (lag_2hr : pct_poc) + (lag_2hr30min : pct_poc) +
    (lag_3hr : pct_poc) + (lag_1day : pct_poc),
  data = train_clean,
  # Use log link for Poisson family.
  family = poisson(link = "log")
)

summary(poisson_model)

Call:
glm(formula = trips ~ as.factor(hour) + dow_simple + temp + precip + 
    days_since_oct1 + is_too_cold + is_evening + is_mid_day + 
    is_morning + lag_1hr + lag_3hr + lag_1day + (is_too_cold * 
    days_since_oct1) + (is_too_cold * is_evening) + (is_too_cold * 
    is_mid_day) + (is_too_cold * is_morning) + (pct_public_commute:is_evening) + 
    (lag_30min:pct_poc) + (lag_1hr:pct_poc) + (lag_1hr30min:pct_poc) + 
    (lag_2hr:pct_poc) + (lag_2hr30min:pct_poc) + (lag_3hr:pct_poc) + 
    (lag_1day:pct_poc), family = poisson(link = "log"), data = train_clean)

Coefficients: (3 not defined because of singularities)
                                Estimate Std. Error z value
(Intercept)                   -1.7636566  0.0593446 -29.719
as.factor(hour)1              -0.0797624  0.0302549  -2.636
as.factor(hour)2              -0.2109837  0.0324117  -6.509
as.factor(hour)3              -0.3977461  0.0352557 -11.282
as.factor(hour)4              -0.6450638  0.0397779 -16.217
as.factor(hour)5              -1.1114147  0.0477636 -23.269
as.factor(hour)6              -1.4047396  0.0556284 -25.252
as.factor(hour)7              -1.8758496  0.0689854 -27.192
as.factor(hour)8              -2.4402610  0.0901966 -27.055
as.factor(hour)9              -2.0536310  0.0748974 -27.419
as.factor(hour)10             -0.8667771  0.0447133 -19.385
as.factor(hour)11             -0.1645916  0.0340834  -4.829
as.factor(hour)12              0.2300467  0.0296700   7.754
as.factor(hour)13              0.2528725  0.0288529   8.764
as.factor(hour)14              0.0528723  0.0297678   1.776
as.factor(hour)15             -0.1145229  0.0307962  -3.719
as.factor(hour)16              0.0581978  0.0297665   1.955
as.factor(hour)17              0.1381832  0.0287088   4.813
as.factor(hour)18              0.2021018  0.0283003   7.141
as.factor(hour)19              0.5857591  0.0316157  18.528
as.factor(hour)20              0.7033996  0.0307799  22.853
as.factor(hour)21              0.8418961  0.0297873  28.264
as.factor(hour)22              0.7389566  0.0299319  24.688
as.factor(hour)23              0.4589391  0.0311745  14.722
dow_simple2                    0.0173347  0.0168996   1.026
dow_simple3                    0.0197736  0.0161964   1.221
dow_simple4                   -0.0851730  0.0167662  -5.080
dow_simple5                   -0.0769256  0.0186471  -4.125
dow_simple6                   -0.0369481  0.0188844  -1.957
dow_simple7                   -0.0211379  0.0174112  -1.214
temp                           0.0082578  0.0007471  11.053
precip                         2.7901668  0.8170372   3.415
days_since_oct1               -0.0046693  0.0004951  -9.432
is_too_cold                    1.6092377  0.2446812   6.577
is_evening                            NA         NA      NA
is_mid_day                            NA         NA      NA
is_morning                            NA         NA      NA
lag_1hr                        0.2100443  0.0077946  26.947
lag_3hr                        0.1913831  0.0092304  20.734
lag_1day                       0.2648014  0.0071568  37.000
days_since_oct1:is_too_cold   -0.0307556  0.0042942  -7.162
is_too_cold:is_evening         0.0082060  0.0607852   0.135
is_too_cold:is_mid_day        -0.6340681  0.3829584  -1.656
is_too_cold:is_morning         0.5058861  0.2382000   2.124
is_evening:pct_public_commute -1.7754217  0.0957888 -18.535
lag_30min:pct_poc              0.4017822  0.0090298  44.495
lag_1hr:pct_poc               -0.3429049  0.0221392 -15.489
pct_poc:lag_1hr30min           0.2671502  0.0103251  25.874
pct_poc:lag_2hr                0.2530428  0.0103360  24.482
pct_poc:lag_2hr30min           0.1297805  0.0110509  11.744
lag_3hr:pct_poc               -0.3280303  0.0260991 -12.569
lag_1day:pct_poc              -0.2046261  0.0195001 -10.494
                                          Pr(>|z|)    
(Intercept)                   < 0.0000000000000002 ***
as.factor(hour)1                          0.008380 ** 
as.factor(hour)2               0.00000000007540499 ***
as.factor(hour)3              < 0.0000000000000002 ***
as.factor(hour)4              < 0.0000000000000002 ***
as.factor(hour)5              < 0.0000000000000002 ***
as.factor(hour)6              < 0.0000000000000002 ***
as.factor(hour)7              < 0.0000000000000002 ***
as.factor(hour)8              < 0.0000000000000002 ***
as.factor(hour)9              < 0.0000000000000002 ***
as.factor(hour)10             < 0.0000000000000002 ***
as.factor(hour)11              0.00000137158800947 ***
as.factor(hour)12              0.00000000000000894 ***
as.factor(hour)13             < 0.0000000000000002 ***
as.factor(hour)14                         0.075707 .  
as.factor(hour)15                         0.000200 ***
as.factor(hour)16                         0.050566 .  
as.factor(hour)17              0.00000148482078556 ***
as.factor(hour)18              0.00000000000092430 ***
as.factor(hour)19             < 0.0000000000000002 ***
as.factor(hour)20             < 0.0000000000000002 ***
as.factor(hour)21             < 0.0000000000000002 ***
as.factor(hour)22             < 0.0000000000000002 ***
as.factor(hour)23             < 0.0000000000000002 ***
dow_simple2                               0.305011    
dow_simple3                               0.222139    
dow_simple4                    0.00000037736818078 ***
dow_simple5                    0.00003701802321018 ***
dow_simple6                               0.050402 .  
dow_simple7                               0.224733    
temp                          < 0.0000000000000002 ***
precip                                    0.000638 ***
days_since_oct1               < 0.0000000000000002 ***
is_too_cold                    0.00000000004804365 ***
is_evening                                      NA    
is_mid_day                                      NA    
is_morning                                      NA    
lag_1hr                       < 0.0000000000000002 ***
lag_3hr                       < 0.0000000000000002 ***
lag_1day                      < 0.0000000000000002 ***
days_since_oct1:is_too_cold    0.00000000000079451 ***
is_too_cold:is_evening                    0.892612    
is_too_cold:is_mid_day                    0.097781 .  
is_too_cold:is_morning                    0.033688 *  
is_evening:pct_public_commute < 0.0000000000000002 ***
lag_30min:pct_poc             < 0.0000000000000002 ***
lag_1hr:pct_poc               < 0.0000000000000002 ***
pct_poc:lag_1hr30min          < 0.0000000000000002 ***
pct_poc:lag_2hr               < 0.0000000000000002 ***
pct_poc:lag_2hr30min          < 0.0000000000000002 ***
lag_3hr:pct_poc               < 0.0000000000000002 ***
lag_1day:pct_poc              < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 176791  on 168964  degrees of freedom
Residual deviance: 132734  on 168916  degrees of freedom
AIC: 206296

Number of Fisher Scoring iterations: 6
Code
# Generate predictions on response scale.
test_clean$pred_poisson <- predict(poisson_model, newdata = test_clean, type = "response")
# Calculate MAE for Poisson model.
mae_poisson <- mean(abs(test_clean$trips - test_clean$pred_poisson), na.rm = TRUE)

# Compare Poisson to OLS.
cat("OLS Updated Model 2 MAE:", round(mae_update, 4), "\n")
OLS Updated Model 2 MAE: 0.227 
Code
cat("Poisson Model MAE:", round(mae_poisson, 4), "\n")
Poisson Model MAE: 0.2254 

The Poisson model has a lower MAE by a small margin. However, it should be noted that the complexities of the models may have overfit due to the number of significant coefficients.

Code
# Check OLS model.
# Predict test data.
test_clean$pred_ols_best <- predict(update_model, newdata = test_clean)
# Predict train data.
train_clean$pred_ols_train <- predict(update_model, newdata = train_clean)

mae_test_ols <- mean(abs(test_clean$trips - test_clean$pred_ols_best), na.rm = TRUE)
mae_train_ols <- mean(abs(train_clean$trips - train_clean$pred_ols_train), na.rm = TRUE)

cat("OLS MAE (Test Set, Out-of-Sample):", round(mae_test_ols, 4), "\n")
OLS MAE (Test Set, Out-of-Sample): 0.227 
Code
cat("OLS MAE (Train Set, In-Sample): ", round(mae_train_ols, 4), "\n")
OLS MAE (Train Set, In-Sample):  0.3703 
Code
cat("OLS Difference (Train - Test): ", round(mae_train_ols - mae_test_ols, 4), "\n\n")
OLS Difference (Train - Test):  0.1433 
Code
# Check Poisson model.
# Predict test data.
test_clean$pred_poisson_best <- predict(poisson_model, newdata = test_clean, type = "response")
# Predict train data.
train_clean$pred_poisson_train <- predict(poisson_model, newdata = train_clean, type = "response")

mae_test_poisson <- mean(abs(test_clean$trips - test_clean$pred_poisson_best), na.rm = TRUE)
mae_train_poisson <- mean(abs(train_clean$trips - train_clean$pred_poisson_train), na.rm = TRUE)

cat("Poisson MAE (Test Set, Out-of-Sample):", round(mae_test_poisson, 4), "\n")
Poisson MAE (Test Set, Out-of-Sample): 0.2254 
Code
cat("Poisson MAE (Train Set, In-Sample): ", round(mae_train_poisson, 4), "\n")
Poisson MAE (Train Set, In-Sample):  0.3912 
Code
cat("Poisson Difference (Train - Test): ", round(mae_train_poisson - mae_test_poisson, 4), "\n")
Poisson Difference (Train - Test):  0.1658 

Surprisingly, it looks like the test set for the MAE is lower, so it is generalizing better to the future than the training period.

Code
# Calculate station-level improvement.
station_improvements <- test_clean %>%
  group_by(start_station_id, start_lat, start_lon) %>%
  summarize(
    # MAE for original Model 2.
    mae_model2 = mean(abs(trips - pred2), na.rm = TRUE),
    # MAE for updated model.
    mae_update = mean(abs(trips - pred_update), na.rm = TRUE),
    # Calculate improvement (positive = better).
    improvement = mae_model2 - mae_update,
    .groups = "drop"
  )

# Create map showing improvement by station.
improvement_map <- ggplot() +
  # Plot census tracts as background.
  geom_sf(data = philly_census, fill = "#4c6e52", color = "#f5f4f0", linewidth = 0.5) +
  # Plot stations colored by improvement.
  geom_point(
    data = station_improvements,
    aes(x = start_lon, y = start_lat, color = improvement, size = abs(improvement)),
    alpha = 0.8
  ) +
  # Diverging color scale.
  scale_color_gradient2(
    low = "#0889bc", mid = "#fef7d7", high = "#e83b2e",
    midpoint = 0,
    name = "MAE Change"
  )+
  # Order legend items.
  guides(
    size = guide_legend(order = 1),
    color = guide_colorbar(order = 2)
  ) +
  labs(
    title = "Model Improvement by Indego Station",
    subtitle = "Philadelphia, PA",
    caption = "Positive values indicate better predictions with new features.",
    size = "Absolute Improvement"
  ) +
  theme_map() +
  theme(
    # Stack legend items vertically.
    legend.box = "vertical"
  )

improvement_map

Code
station_census_lookup$pct_non_yt <- (1 - station_census_lookup$pct_white)

# Calculate percent non-white for analysis.
station_census_lookup$pct_non_yt <- (1 - station_census_lookup$pct_white)

# Join demographics to improvement data.
improvement_demo <- station_improvements %>%
  left_join(
    # Get demographic variables.
    station_census_lookup %>% 
      select(start_station_id, med_inc, pct_public_commute, pct_non_yt), 
    by = "start_station_id"
  ) %>%
  left_join(
    # Get average demand from error analysis.
    station_errors_q4 %>% 
      select(start_station_id, avg_demand),
    by = "start_station_id"
  ) %>%
  # Keep only stations with census data.
  filter(!is.na(med_inc))

# Create scatter plot of improvement vs race.
mae_poc_improvement_plot <- ggplot(improvement_demo, aes(x = pct_non_yt, y = improvement)) +
  # Size points by average demand.
  geom_point(aes(size = avg_demand), alpha = 0.5, color = "#EF4269") +
  # Horizontal line at zero improvement.
  geom_hline(yintercept = 0, linetype = "dashed", color = "#4269EF", linewidth = 1) +
  # Trend line with confidence band.
  geom_smooth(method = "lm", se = TRUE, color = "#2d2a26", linetype = "longdash", linewidth = 1) +
  # Format x-axis as percentage.
  scale_x_continuous(labels = scales::percent) +
  labs(
    title = "Model Improvement by Station",
    subtitle = "% Non-White Population",
    x = "% Non-White",
    y = "MAE Improvement (Trips per 30-Min)",
    size = "Average Demand"
  ) +
  theme_plot()

# Create scatter plot of improvement vs income.
mae_inc_improvement_plot <- ggplot(improvement_demo, aes(x = med_inc, y = improvement)) +
  geom_point(aes(size = avg_demand), alpha = 0.5, color = "#EF4269") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "#4269EF", linewidth = 1) +
  geom_smooth(method = "lm", se = TRUE, color = "#2d2a26", linetype = "longdash", linewidth = 1) +
  scale_x_continuous(labels = scales::dollar) +
  labs(
    title = "Model Improvement by Station",
    subtitle = "Median Income",
    x = "Median Income",
    y = "MAE Improvement (Trips per 30-Min)",
    size = "Average Demand"
  ) +
  theme_plot()

mae_pub_improvement_plot <- ggplot(improvement_demo, aes(x = pct_public_commute, y = improvement)) +
  geom_point(aes(size = avg_demand), alpha = 0.5, color = "#EF4269") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "#4269EF", linewidth = 1) +
  geom_smooth(method = "lm", se = TRUE, color = "#2d2a26", linetype = "longdash", linewidth = 1) +
  scale_x_continuous(labels = scales::percent) +
  labs(
    title = "Model Improvement by Station",
    subtitle = "% Public Transit Commuters",
    caption = "Points above the dashed line show improvement (MAE decrease).\nChange in MAE (Old MAE - New MAE) vs. Median Income",
    x = "Median Income",
    y = "MAE Improvement (Trips per 30-Min)",
    size = "Average Demand"
  ) +
  theme_plot()

# Stack vertically.
mae_poc_improvement_plot / mae_inc_improvement_plot / mae_pub_improvement_plot

Code
# Filter improvement data for stations where improvement is negative.
worse_stations <- improvement_demo %>%
  filter(improvement <= 0) %>%
  # Sort by magnitude of negative improvement.
  arrange(improvement) %>% 
  select(
    start_station_id,
    improvement_mae = improvement,
    avg_demand = avg_demand,
    med_inc = med_inc,
    pct_poc = pct_non_yt,
    pct_public_commute = pct_public_commute
  )

# Kable.
worse_stations_kable <- worse_stations %>%
  mutate(
    # Formatting.
    improvement_mae = round(improvement_mae, 4),
    avg_demand = round(avg_demand, 2),
    med_inc = scales::dollar(med_inc),
    pct_poc = scales::percent(pct_poc, accuracy = 1),
    pct_public_commute = scales::percent(pct_public_commute, accuracy = 1)
  ) %>%
  kable(
    caption = "Updated Model: Stations Where Feature Engineering Worsened MAE",
    col.names = c(
      "Station ID", "MAE Change", "Average Demand", "Median Income", 
      "% Non-White", "% Public Transit"
    )
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

worse_stations_kable
Updated Model: Stations Where Feature Engineering Worsened MAE
Station ID MAE Change Average Demand Median Income % Non-White % Public Transit
3064 -0.0155 0.20 $116,354 53% 15%
3363 -0.0069 0.29 $61,850 39% 6%
3028 -0.0023 0.45 $144,954 11% 14%
3165 -0.0002 0.16 $102,330 38% 11%
Code
# Map to see where stations worsened.
worsened_stations_map_data <- station_improvements %>%
  filter(improvement < 0)

worsened_map <- ggplot() +
  # Census tracts.
  geom_sf(data = philly_census, fill = "#4c6e52", color = "#f5f4f0", linewidth = 0.5) +
  # Stations with negative improvement.
  geom_point(
    data = worsened_stations_map_data,
    aes(x = start_lon, y = start_lat, 
        # Color by the negative improvement.
        color = improvement, 
        # Size by the absolute error increase.
        size = abs(improvement)),
    alpha = 0.9
  ) +
  scale_color_gradient(
    low = "#e83b2e",
    high = "#990000",
    name = "MAE Increase (Trips per 30-Min)"
  ) +
  guides(
    size = guide_legend(order = 1),
    color = guide_colorbar(order = 2)
  ) +
  labs(
    title = "Indego Stations with Worsened Predictions",
    subtitle = "Philadelphia, PA",
    caption = "Points show where updated model caused MAE to increase.\nSize indicates the magnitude of the MAE increase.",
    size = "Absolute Error Increase"
  ) +
  theme_map() +
  theme(legend.box = "vertical")

worsened_map

Code
# Calculate final error based on updated model prediction (pred_update).
test_q4_error_final <- test_clean %>%
    mutate(
        error_final = trips - pred_update,
        # Create numeric 0-47 index for plot.
        time_slot_numeric = hour(interval30) * 2 + minute(interval30) / 30 
    )

# Calculate bias based on 30-min time.
temporal_bias_final <- test_q4_error_final %>%
  group_by(time_slot_numeric, is_weekend) %>%
  summarize(
    mean_error = mean(error_final, na.rm = TRUE),
    .groups = "drop"
    ) %>%
  mutate(day_type = ifelse(is_weekend, "Weekend", "Weekday"))

# Bias line plot.
bias_plot <- ggplot(temporal_bias_final, aes(x = time_slot_numeric, y = mean_error, color = day_type)) +
  # Add reference line at zero.
  geom_hline(yintercept = 0, linetype = "dashed", color = "#f48f33", linewidth = 1) +
  # Line with points.
  geom_line(linewidth = 1.5) +
  geom_point(size = 3) +
  scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  scale_x_continuous(breaks = seq(0, 48, by = 4), labels = seq(0, 24, by = 2)) +
  labs(
    title = "Prediction Bias by 30-Minute Intervals and Day Type",
    subtitle = "2024 Q4 Test Set",
    caption = "Positive values mean the model under-predicts (demand > prediction).\nMean Error (Actual Trips - Predicted Trips)",
    x = "Hour",
    y = "Mean Error (Trips per 30-Min)",
    color = "Day Type"
    ) +
  theme_plot()

bias_plot

For the added features in the updated model, two variables days_since_oct1 and is_too_cold were created and added to the model. The first was chosen to represent the previous observation that ridership was decreasing heading further into the winter season, and the second variable was because of the winter season.

Then, flags for evening and mid-day hours were created, for 7:00 PM to 11:00 PM and 10:00 AM to 3:00 PM, respectively. This is because when the previously mentioned added features were run, it was still visible that there was systematic under-prediction during those times.

Another interaction that was included was is_evening and pct_public_commute, because people may prefer public transit because they are tired and to protect them from weather on the way home from work.

What eventually increased the predictive model up to 9.51% improvement in MAE were the lag interactions with the created POC (people of color) variable. This logic came from personal experience, as non-white individuals, whether US or foreign, retain cultural habits and are influenced by time perception, especially since time is communicated differently in different languages.

It’s also important to note that the interactions that were included did not include the main effects plus the interaction term, I was explicit about not using race itself as it actually worsened the model.

New terms and interactions:

  • is_too_cold
  • is_evening
  • is_mid_day
  • is_morning
  • pct_public_commute
  • pct_poc
  • (is_too_cold * is_evening)
  • (is_too_cold * is_mid_day)
  • (is_too_cold * is_morning)
  • (pct_public_commute : is_evening)
  • (lag_30min : pct_poc)
  • (lag_1hr : pct_poc)
  • (lag_1hr30min : pct_poc)
  • (lag_2hr : pct_poc)
  • (lag_2hr30min : pct_poc)
  • (lag_3hr : pct_poc)
  • (lag_1day : pct_poc)

In the model improvement map, it looks like the vast majority of stations had MAE improvements, although it doesn’t give a more granular look like the plot, which looks at how the stations changed across different demographics: non-white, median income, and public transit commuters. Around 3 stations look like their MAE increased (worsened) or were stagnant, with some other stations having very marginal improvements being just barely above the line.

Looking at prediction bias across hours and day types, it looks like this updated model systematically underestimates the demand surge during the morning commute, even after going back to specifically try and improve these times.

According to Indego’s station list, these are the worsened locations in the Center City area, which are currently active.

  • 3064, 18th & Washington, Chew Playground
  • 3363, 13th & Spruce
  • 3028, 4th & Bainbridge
  • 3165, 24th & Race SRT

It looks like these areas generally have a median income higher than $100,000 except for station 3363 with around $60,000.

PART IV: CRITICAL REFLECTION

1. Operational Implications

It looks like the final MAE is “good enough” for Indego’s use. An error of 0.23 trips per 30-minutes indicates the model is generally off by a bike every two hours in each station, so a low error rate like this could help Indego employees better manage bike stations and bike availability.

On the previous statement about prediction bias during mornings, this is critical as bikes will run out. Then there’s a mid-day dip between 9:00 AM and 3:00 PM, the pit being at 12, so it is predicting a big and sustained usage in the middle of the day that doesn’t actually materialize. The case is the same with 9:00 PM with the lowest pit in the plot where the model misses that evening shutdown. Also critical, because now bikes are accumulating at stations in the evening.

I would recommend deploying this system as an aid, not an end-all-be-all solution, even with a narrow MAE, it is not 0.

2. Equity Considerations

The initial model does show a socioeconomic bias as described at the end of the previous section. So it is important to deploy safeguard rails since highest MAE is still in Center City. This means resource allocation could divert limited resources away from stations with lower demand and transit-dependent neighborhoods where the model says it’s accurate. Contextually, it could affect communities with less mobility. A rail could be creating different weights for communities that are less mobile.

To add more to the POC and lag interactions, a lot of online and offline American spaces often reference “Asian time”, “Black people time”, “Mexican time”, “CPT time (colored people time)” etc. within ethnic and/or racial communities.

Some of the terms have been used pejoratively, especially CPT time’s etymology rooted in racism and anti-Blackness in Antebellum South, which was weaponized by white slave-owners to further permit cruelty and dehumanization.

While they were negatively used to prescribe laziness to Black individuals (then other races downstream as the US diversified), these are real habits that are mostly constrained and perceived negatively in the context of Western punctuality.

3. Model Limitations

Another helpful method could be to try Negative Binomial and compare with the Poisson model’s MAE.

This model is assuming that stations are at capacity with all available bikes, but deployment in reality relies on human behavior and real-world urban variability, so additional data on real-time availability (which does exist in the Indego app) and inventory would be helpful for the predictive model.

Spatial proximity to bike lanes could also be helpful, although I am slightly skeptical it could improve MAE drastically as a lot of bikers do ride on sidewalks, but that is a separate conversation because it may mean a lack of safe biking infrastructure.

I also was correct in assuming that making the temporal aspect more granular would improve prediction as well since a lot of events and work days can start at 30-minute marks. I am unsure of 15-minute granularity, but it couldn’t hurt to see if there’s improvement.

NOTE: Credit and AI Usage

Base code template provided by Dr. Delmelle with adjustments changing it from hourly analysis to 30-minute interval analysis.

Free version of Claude Sonnet 4.5 used for debugging code, especially the final panel data portion.

Visuals taken from previous labs and lecture examples and modified.